UI localization (also called grounding) is how AI agents interact with software through the same visual interface humans use. Given a screenshot and a task like “Book a hotel in Paris,” the model identifies exactly where to click by outputting precise pixel coordinates. This guide shows you how to get started by installing dependencies, loading the model and processor, and running your first localization task.

Step 1: Install required dependencies

First, import the required dependencies. These instructions were tested on Python >= 3.11.
pip install -q -U accelerate matplotlib pillow pydantic requests torchvision "transformers>=4.54.0,<4.57.0"
from typing import Any, Literal, TypeAlias
import requests
import torch
from PIL import Image
from pydantic import BaseModel, Field
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

Step 2: Load model and processor

Our Holo1.5 family of VLMs includes three model sizes to cover hardware requirements for deployment: 3B, 7B, and 72B. All checkpoints are state-of-the-art (SOTA) with respect to model size.
model_name = "Hcompany/Holo1.5-3B"  # or "Hcompany/Holo1.5-7B", "Hcompany/Holo1.5-72B"

# Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_name)

Step 3: Define actions and prompt

The model uses structured output to return precise navigation actions. This example specifically demonstrates the localization capability of our Holo1.5 model.
class ClickAbsoluteAction(BaseModel):
    """Click at absolute coordinates."""

    action: Literal["click_absolute"] = "click_absolute"
    x: int = Field(description="The x coordinate, number of pixels from the left edge.")
    y: int = Field(description="The y coordinate, number of pixels from the top edge.")

ChatMessage: TypeAlias = dict[str, Any]

def get_chat_messages(task: str, image: Image.Image) -> list[ChatMessage]:
    """Create the prompt structure for navigation task"""

    prompt = f"""Localize an element on the GUI image according to the provided target and output a click position.

    \* You must output a valid JSON following the format: {ClickAbsoluteAction.model_json_schema()}

    Your target is:"""

    return [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": f"{prompt}\n{task}"},
            ],
        },
    ]

Step 4: Download image

Download a sample screenshot and define the navigation task.
# Example image URL for a web screenshot (replace with your own image)
image_url: str = "https://raw.githubusercontent.com/hcompai/hai-cookbook/445d2017fcc8a0867081ea4786c34f87ed7053eb/data/calendar_example.jpg"

# Download and open image
response = requests.get(image_url, stream=True)
image = Image.open(response.raw)

# Define task
task: str = "Book a hotel in Paris on August 3rd for 3 nights"
print(f"Task: {task}")

image.show()
Holo1 5getstarted Navigationtask Pn

Step 5: Set task and prepare image

Resize the input image, build the task prompt, and process everything into model-ready inputs.
# Resize image according to model's image processor
image_processor_config = processor.image_processor

resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=image_processor_config.patch_size * image_processor_config.merge_size,
    min_pixels=image_processor_config.min_pixels,
    max_pixels=image_processor_config.max_pixels,
)

processed_image: Image.Image = image.resize(size=(resized_width, resized_height), resample=Image.Resampling.LANCZOS)

# Create the prompt
messages: list[dict[str, Any]] = get_chat_messages(task, processed_image)

# Apply chat template
text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Process inputs
inputs = processor(
    text=[text_prompt],
    images=[processed_image],
    padding=True,
    return_tensors="pt",
).to(model.device)

Step 6: Run inference

Because structured output is not enabled and generation is stochastic, parsing the output might fail. For production we recommend vLLM for:
  • Faster inference performance
  • Built-in structured output validation
  • Reliable JSON schema compliance
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=256)

# Decode output
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
result = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print("Raw output:", result)

# Parse the JSON output
try:
    action = ClickAbsoluteAction.model_validate_json(result)
    print("Successfully parsed model output")
except Exception as e:
    print(f"Could not parse JSON output: {e}")

Step 7: Visualize results

Display the model’s predicted click positions on the image, highlighting targets and annotations for easy interpretation.
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.figure import Figure
from matplotlib.axes import Axes

def visualize_click(action: ClickAbsoluteAction, processed_img: Image.Image, task: str = "") -> tuple[Figure, Axes]:
    """
    Visualize the navigation prediction on the image.

    Args:
        action: The parsed navigation step with action coordinates
        processed_img: The processed PIL image
        task: The task description for the title
    Returns:
        Tuple of matplotlib Figure and Axes objects
    """
    fig, ax = plt.subplots(1, 1, figsize=(12, 8))
    ax.imshow(processed_img)

    # Plot red cross at predicted coordinates
    ax.plot(action.x, action.y, "r+", markersize=20, markeredgewidth=3)

    # Add a circle around the cross for better visibility
    circle = patches.Circle((action.x, action.y), 10, linewidth=2, edgecolor="red", facecolor="none")
    ax.add_patch(circle)

    # Add text annotation
    if task:
        ax.annotate(
            f"Click: {task}",
            xy=(action.x, action.y),
            xytext=(action.x + 20, action.y - 20),
            arrowprops=dict(arrowstyle="->", color="red"),
            bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7),
            fontsize=10,
        )

    ax.set_title(f"Navigation Prediction for: {task}")
    ax.axis("off")
    plt.tight_layout()
    return fig, ax

print(f"Predicted action: Click on '{task}' at coordinates ({action.x}, {action.y})")
fig, ax = visualize_click(action, processed_image, task=task)
plt.show()

Example output

Holo1 5getstarted Visual Pn The image above demonstrates the following:
  • Red cross (⊕): Exact pixel coordinates where the model wants to click
  • Red circle: Visual boundary around the target area
  • Yellow annotation: Description of what element the model identified
The model correctly identified “August 3rd on the calendar” at coordinates (947, 338), indicating precise localization within the complex calendar interface.