Holo1.5 is a state-of-the-art Action Vision-Language Model developed by H Company, achieving 10%+ accuracy improvements over Holo1. Fine-tuned from Qwen/Qwen2.5-VL-7B-Instruct, it excels at UI element localization - the critical skill that enables Computer Use agents to navigate digital interfaces like humans do.

Prerequisite: API Key

Log onto Portal-H to retrieve or generate an API key to access the model hosted by H Company.

Step 1: Install required Python dependencies

Set up the required Python dependencies before starting.
!python -m pip install -q -U openai pillow pydantic python-dotenv rich
from typing import Any, Literal, TypeAlias

from PIL import Image
from pydantic import BaseModel, Field
import pathlib as pl
import sys
from IPython.display import display
# Add directory to the Python path to import utils package
project_root_dir = pl.Path.cwd().resolve().parents[1]
sys.path.append(str(project_root_dir))

# Get cursor .png path
cursor_img_path = project_root_dir / "data" / "cursor_image_red.png"

Step 2: Prepare the Input Payload

Holo1.5 is hosted on H Company’s inference platform and is compatible with the OpenAI Chat Completions API protocol. In this example, Holo1.5 is prompted to identify the correct location to click in order to successfully select a date on the calendar. Important: The input image must be resized using Qwen’s smart_resize method. This is crucial because the served model performs its own internal resizing, which can alter the original image dimensions and lead to inaccurate or misaligned coordinate predictions from the model.
from utils.image import convert_image_to_base64_url

class ClickAbsoluteAction(BaseModel):
    """Click at absolute coordinates."""

    action: Literal["click_absolute"] = "click_absolute"
    x: int = Field(description="The x coordinate, number of pixels from the left edge.")
    y: int = Field(description="The y coordinate, number of pixels from the top edge.")


ChatMessage: TypeAlias = dict[str, Any]

LOCALIZATION_TASK_PROMPT = f"""Localize an element on the GUI image according to the provided target and output a click position.
     * You must output a valid JSON following the format: {ClickAbsoluteAction.model_json_schema()}
     Your target is:"""

def build_messages(
    task: str, image: Image.Image, image_format: str
) -> list[ChatMessage]:
    """Build the messages for the localization task.

    Args:
        image: Image providing context for the localization task.
        instruction: User instruction for the localization task.
        format: PIL image format (see https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html).

    Returns:
        List of messages for the localization task.
    """
    image_url = convert_image_to_base64_url(image=image, format=image_format)
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url", 
                    "image_url": {
                        "url": image_url,
                    },
                },
                {"type": "text", "text": f"{LOCALIZATION_TASK_PROMPT}\n{task}"},
            ],
        }
    ]
Prepare the input request
# Prepare inference request
from utils.image import smart_resize

# Load image
image_path = project_root_dir / "data" / "calendar_example.jpg"
image = Image.open(image_path)

# Resize the image so that predicted absolute coordinates match the size of the image.
resized_height, resized_width = smart_resize(height=image.height, width=image.width)
image = image.resize(size=(resized_width, resized_height), resample=None)

task = "Select July 14th as the check-out date"

Step 3: Invoke Holo1.5 via API

This section shows you how to call the Holo1.5 model via API to perform a localization task.

Step 3a: Set up your API key

You can provide the API key in two ways:
  • Directly by assigning it to the API_KEY variable.
  • Indirectly by adding it to a .env file under the variable name HAI_API_KEY.
If no API key is found, an error will be raised.
API_KEY = ""
if not API_KEY:
    import os
    if os.path.exists("../../.env"):
        from dotenv import load_dotenv
        # By default, looks for a .env file in the current directory or parents
        load_dotenv(override=True)
        API_KEY = os.getenv("HAI_API_KEY")
    else:
        !cp ../../.env.example ../../.env
        assert False, "Please fill in the API key in the .env file"


assert API_KEY, "API_KEY is not set, please set the HAI_API_KEY environment variable or fill in the API_KEY variable"

Step 3b: Set up the model

Prepare the model endpoint by providing the model name and base URL.
MODEL_NAME = "<model-name>"
API_BASE_URL = "<api-base-url>"
BASE_URL = f"{API_BASE_URL}/{MODEL_NAME}"

Step 3c: Invoke the model

Send the image and task to the model via the API, then handle and display the model’s output.
from utils.image import draw_image_with_click
from openai import OpenAI
import rich
import json

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

chat_completion = client.chat.completions.create(
    model=MODEL_NAME,
    messages=build_messages(
        task=task,
        image=image,
        image_format="JPEG",
    ),
    extra_body={
        "guided_json": ClickAbsoluteAction.model_json_schema(),
    },
    temperature=0,
)


click = ClickAbsoluteAction(**json.loads(chat_completion.choices[0].message.content))
rich.print(click)

display(draw_image_with_click(image, click.x, click.y, cursor_img_path))
ClickAbsoluteAction(action='click_absolute', x=342, y=345)
Getstarted Holo1 5invokelocalization Pn