Skip to main content
Holo2 is a state-of-the-art Action Vision-Language Model developed by H Company. Fine-tuned from the Qwen3-VL Thinking Family, it excels at UI element localization and UI navigation across environments (computer, web, mobile). This quickstart demonstrates the core grounding capability: (screenshot + task) → click coordinates.

Prerequisite: API Key

Log onto Portal-H to retrieve or generate an API key to access the model hosted by H Company.

Step 1: Install required Python dependencies

Set up the required Python dependencies before starting.
!python -m pip install -q -U openai pillow pydantic python-dotenv rich
from typing import Any, Literal, TypeAlias

from PIL import Image
from pydantic import BaseModel, Field
import pathlib as pl
import sys
from IPython.display import display
# Add directory to the Python path to import utils package
project_root_dir = pl.Path.cwd().resolve().parents[1]
sys.path.append(str(project_root_dir))

# Get cursor .png path
cursor_img_path = project_root_dir / "data" / "cursor_image_red.png"

Step 2: Prepare the Input Payload

Holo2 is hosted on H Company’s inference platform and is compatible with the OpenAI Chat Completions API protocol. In this example, Holo2 is prompted to identify the correct location to click in order to successfully select a date on the calendar. Important: The input image must be resized using Qwen’s smart_resize method. This is crucial because the served model performs its own internal resizing, which can alter the original image dimensions and lead to inaccurate or misaligned coordinate predictions from the model.
from utils.image import convert_image_to_base64_url

class ClickAbsoluteAction(BaseModel):
    """Click at absolute coordinates."""

    action: Literal["click_absolute"] = "click_absolute"
    x: int = Field(description="The x coordinate, number of pixels from the left edge.")
    y: int = Field(description="The y coordinate, number of pixels from the top edge.")


ChatMessage: TypeAlias = dict[str, Any]

LOCALIZATION_TASK_PROMPT = f"""Localize an element on the GUI image according to the provided target and output a click position.
     * You must output a valid JSON following the format: {ClickAbsoluteAction.model_json_schema()}
     Your target is:"""

def build_messages(
    task: str, image: Image.Image, image_format: str
) -> list[ChatMessage]:
    """Build the messages for the localization task.

    Args:
        image: Image providing context for the localization task.
        instruction: User instruction for the localization task.
        format: PIL image format (see https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html).

    Returns:
        List of messages for the localization task.
    """
    image_url = convert_image_to_base64_url(image=image, format=image_format)
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url", 
                    "image_url": {
                        "url": image_url,
                    },
                },
                {"type": "text", "text": f"{LOCALIZATION_TASK_PROMPT}\n{task}"},
            ],
        }
    ]
Prepare the input request
# Prepare inference request
from utils.image import smart_resize

# Load image
image_path = project_root_dir / "data" / "calendar_example.jpg"
image = Image.open(image_path)

# Resize the image so that predicted absolute coordinates match the size of the image.
resized_height, resized_width = smart_resize(height=image.height, width=image.width)
image = image.resize(size=(resized_width, resized_height), resample=None)

task = "Select July 14th as the check-out date"

Step 3: Invoke Holo2 via API

This section shows you how to call the Holo1.5 model via API to perform a localization task.

Step 3a: Set up your API key

You can provide the API key in two ways:
  • Directly by assigning it to the API_KEY variable.
  • Indirectly by adding it to a .env file under the variable name HAI_API_KEY.
If no API key is found, an error will be raised.
API_KEY = ""
if not API_KEY:
    import os
    if os.path.exists("../../.env"):
        from dotenv import load_dotenv
        # By default, looks for a .env file in the current directory or parents
        load_dotenv(override=True)
        API_KEY = os.getenv("HAI_API_KEY")
    else:
        !cp ../../.env.example ../../.env
        assert False, "Please fill in the API key in the .env file"


assert API_KEY, "API_KEY is not set, please set the HAI_API_KEY environment variable or fill in the API_KEY variable"

Step 3b: Set up the model

Prepare the model endpoint by providing the model name and base URL.
MODEL_NAME = "<model-name>"
API_BASE_URL = "<api-base-url>"
BASE_URL = f"{API_BASE_URL}/{MODEL_NAME}"
API Configuration

Here are the relevant URLs and names to use when configuring the API for Holo:
# Example environment variables

MODEL_NAME = holo1-5-7b-20250915
API_BASE_URL = https://api.hcompanyprod.fr/v1/models
Model name
  • holo1-5-7b-20250915
  • holo1-5-3b-20250915
  • holo1-7b-20250521
  • holo1-3b-20250521

Step 3c: Invoke the model

Send the image and task to the model via the API, then handle and display the model’s output.
from utils.image import draw_image_with_click
from openai import OpenAI
import rich
import json

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

chat_completion = client.chat.completions.create(
    model=MODEL_NAME,
    messages=build_messages(
        task=task,
        image=image,
        image_format="JPEG",
    ),
    extra_body={
        "guided_json": ClickAbsoluteAction.model_json_schema(),
    },
    temperature=0,
)


click = ClickAbsoluteAction(**json.loads(chat_completion.choices[0].message.content))
rich.print(click)

display(draw_image_with_click(image, click.x, click.y, cursor_img_path))
ClickAbsoluteAction(action='click_absolute', x=342, y=345)
Holo1 5getstarted Navigationtask Pn