Element localization

Pass Holo a screenshot (URL or base64 data URI) and a text description of an element; get click coordinates back. Single-turn, no history, no thinking: set temperature=0.0 and enable_thinking=False. It is a grounding primitive you can use as a vision tool inside any agent, and both Holo3 and Holo3.1 support it. Set up the OpenAI client first by following the Quickstart.

from pydantic import BaseModel, Field

MODEL_NAME = "holo3-1-35b-a3b"
SCREENSHOT_URL = "https://your-host/screenshot.png"  # or "data:image/png;base64,..."
SCREENSHOT_WIDTH, SCREENSHOT_HEIGHT = 1280, 720
ELEMENT = "the 'Sign in' button in the top-right corner"

class VisualLocalizerOutput(BaseModel):
    x: int = Field(ge=0, le=1000, description="X coordinate as integer in [0, 1000]")
    y: int = Field(ge=0, le=1000, description="Y coordinate as integer in [0, 1000]")

schema = VisualLocalizerOutput.model_json_schema()

prompt = (
    "Localize an element on the GUI image according to the provided target "
    "and output a click position.\n"
    f" * You must output a valid JSON following the format: {schema}\n"
    f" Your target is:\n{ELEMENT}"
)

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": SCREENSHOT_URL}},
            {"type": "text", "text": prompt},
        ],
    }],
    extra_body={
        "structured_outputs": {"json": schema},
        "chat_template_kwargs": {"enable_thinking": False},
    },
    temperature=0.0,
)

point = VisualLocalizerOutput.model_validate_json(response.choices[0].message.content)
abs_x = int(point.x / 1000 * SCREENSHOT_WIDTH)
abs_y = int(point.y / 1000 * SCREENSHOT_HEIGHT)
print(f"Click at ({abs_x}, {abs_y})")

const MODEL_NAME = "holo3-1-35b-a3b";
const SCREENSHOT_URL = "https://your-host/screenshot.png"; // or "data:image/png;base64,..."
const SCREENSHOT_WIDTH = 1280;
const SCREENSHOT_HEIGHT = 720;
const ELEMENT = "the 'Sign in' button in the top-right corner";

const schema = {
  type: "object",
  properties: {
    x: { type: "integer", minimum: 0, maximum: 1000, description: "X coordinate as integer in [0, 1000]" },
    y: { type: "integer", minimum: 0, maximum: 1000, description: "Y coordinate as integer in [0, 1000]" },
  },
  required: ["x", "y"],
};

const prompt =
  "Localize an element on the GUI image according to the provided target " +
  "and output a click position.\n" +
  ` * You must output a valid JSON following the format: ${JSON.stringify(schema)}\n` +
  ` Your target is:\n${ELEMENT}`;

const response = await client.chat.completions.create({
  model: MODEL_NAME,
  messages: [
    {
      role: "user",
      content: [
        { type: "image_url", image_url: { url: SCREENSHOT_URL } },
        { type: "text", text: prompt },
      ],
    },
  ],
  temperature: 0.0,
  // structured_outputs and chat_template_kwargs are H-specific, passed through in the request body
  ...({
    structured_outputs: { json: schema },
    chat_template_kwargs: { enable_thinking: false },
  } as any),
});

const point = JSON.parse(response.choices[0].message.content!) as { x: number; y: number };
const absX = Math.round((point.x / 1000) * SCREENSHOT_WIDTH);
const absY = Math.round((point.y / 1000) * SCREENSHOT_HEIGHT);
console.log(`Click at (${absX}, ${absY})`);

Coordinates come back as integers in [0, 1000], normalized to the image you sent. Scale them to pixels with the image’s own dimensions, and send and scale against the same image bytes: any resize, crop, or DPI mismatch will misplace the point.

Next steps

Agent loop

How to use Holo in your computer-use harness.

API reference

Endpoint, models, parameters, and limits.

Quickstart

Back to setup and your first call.

Get started

Guides

Element localization

Next steps

Agent loop

API reference

Quickstart

​Next steps

Agent loop

API reference

Quickstart

Next steps