Skip to main content
Holo is trained to act as a multi-step agent inside a specific harness, and a few of those conventions have to come along for the model to behave well in yours: an output format, a chat layout for screenshots and tool results, an image budget, and a coordinate convention. Skip any one and quality suffers. Holo supports two output formats, and which ones are available depends on the model:
  • Structured outputs: the model returns a single constrained JSON object per step. Works on Holo3.1 and Holo3.
  • Native function calling: the model returns OpenAI-style tool_calls. Holo3.1 only; Holo3 does not support it.
Pick one and stay in it. The reasoning channel, coordinate convention, and image budget below are identical either way; only how you declare tools and read the model’s output changes. See Output format and tool calls. Set up the OpenAI client first by following the Quickstart.

Reasoning

Holo returns two streams on every call: a thinking trace in reasoning_content and the action in content. Reasoning is essential in agent mode (Holo was trained to plan in reasoning_content before each step), so leave it on; reasoning_effort: "medium" is a sensible default.
extra_body={"chat_template_kwargs": {"enable_thinking": True}}
Past reasoning is dropped between turns by the Qwen 3.5 chat template Holo inherits, so anything the model needs to remember has to flow through content (that is what the note field, below, is for). When re-adding the assistant message to the conversation, push only the parsed output; do not splice the reasoning back in.

Coordinates in [0, 1000]

Send a screenshot at any size. Holo returns coordinates as integers in [0, 1000], normalized to that image. Scale back to pixels using its dimensions:
abs_x = int((x / 1000) * screenshot.width)
abs_y = int((y / 1000) * screenshot.height)
Origin is top-left. Send and scale against the same image bytes; any resize, crop, or DPI mismatch will misclick. Pick one pixel unit (CSS or device) and stay in it end to end.

Image budget

Keep at most the last 3 screenshots in context; more degrades accuracy. Older screenshots should be replaced with a short text placeholder, while keeping the <observation> wrapper. This works the same in both output formats, since observations are always user messages:
def trim_to_last_n_images(messages, n=3):
    seen = 0
    for msg in reversed(messages):
        if msg["role"] != "user" or not isinstance(msg["content"], list):
            continue
        for chunk in msg["content"]:
            if chunk.get("type") != "image_url":
                continue
            seen += 1
            if seen > n:
                chunk["type"] = "text"
                chunk["text"] = "[screenshot evicted]"
                chunk.pop("image_url", None)

Output format and tool calls

The model is constrained, at the decoding level, to emit a single JSON object matching a schema you provide. Tool calls are fields inside that object, so output is always valid JSON.

Output JSON

Each step, the model emits one object with three fields:
{
  "note": "Submit succeeded; receipt URL is /orders/8421.",
  "thought": "Recording the receipt URL before navigating away.",
  "tool_call": {
    "tool_name": "click",
    "element": "Continue button at the bottom right",
    "x": 932,
    "y": 880
  }
}
note is the model’s durable memory: anything from the current screen that future steps will need (URLs, IDs, intermediate answers). Set it to null when nothing new is worth recording. thought is a one-line plan for the next action. tool_call is flat: tool_name is a sibling of the arguments, not nested in an args object.

Constrain output to a tool union

Define each tool as a Pydantic model with a Literal[tool_name] field, then use their union as the response schema. The server’s constrained decoder ensures the model emits exactly one variant, and tool_name is the tag you dispatch on at execution time. The example below ships three tools (click, write, answer) for illustration; real agents register a wider toolbox following the same pattern.
from typing import Literal
from pydantic import BaseModel, Field

class ClickArgs(BaseModel):
    """Click at (x, y) coordinates"""
    tool_name: Literal["click"]
    element: str = Field(description="Detailed description of the target UI element to click on")
    x: int = Field(description="X coordinate as integer in [0, 1000]")
    y: int = Field(description="Y coordinate as integer in [0, 1000]")

class WriteArgs(BaseModel):
    """Type text into the currently focused element without clicking first"""
    tool_name: Literal["write"]
    content: str = Field(description="Content to write")
    press_enter: bool = Field(default=False, description="Whether to press Enter after typing")

class AnswerArgs(BaseModel):
    """Provide a final answer"""
    tool_name: Literal["answer"]
    content: str = Field(description="The answer content")

class Step(BaseModel):
    note: str | None = Field(default=None, description="Task-relevant information from the previous observation. Empty if nothing new.")
    thought: str = Field(description="Reasoning about next steps")
    tool_call: ClickArgs | WriteArgs | AnswerArgs
Embed the same schema inside the system prompt under an <output_format> block (shown in the loop below). The model was trained with the schema visible in both the prompt and structured_outputs, and dropping either copy noticeably hurts reliability.
Use extra_body={"structured_outputs": {"json": ...}}, not OpenAI native function calling (tools=[...] / tool_choice=...). In this mode the model emits a flat {note, thought, tool_call} object in content, not a tool_calls array.
Pass the schema to structured_outputs, then parse content back into your models. Because tool_name is a discriminator, the parsed tool_call narrows to exactly one variant, which is what you dispatch on:
schema = Step.model_json_schema()

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,  # the system prompt embeds this same schema; see the loop below
    extra_body={"structured_outputs": {"json": schema}},
)

step = Step.model_validate_json(resp.choices[0].message.content)

# step.tool_call is now typed as the matching variant (ClickArgs, WriteArgs, ...)
if step.tool_call.tool_name == "answer":
    print(step.tool_call.content)
else:
    execute(step.tool_call)

Chat layout

User observations alternate with assistant JSON; tool results come back as user messages:
RoleBody
systemyour prompt, then the appended <output_format> schema block
user<observation> + screenshot and/or text + </observation>
assistantthe JSON object: {note, thought, tool_call}
user<tool_output tool="click"> + result + </tool_output>
usernext <observation>
assistantnext JSON
Wrap tool results as user messages with <tool_output tool="...">, not as OpenAI tool-role messages.

A complete loop

Plug in your own screenshot() (browser, OS, emulator) and execute(...) dispatcher.
import json, base64

schema = Step.model_json_schema()
system = render_prompt(tools=...) + f"\n\n<output_format>\n```json\n{json.dumps(schema)}\n```\n</output_format>"

messages = [{"role": "system", "content": system}]

for _ in range(MAX_STEPS):
    image_bytes = screenshot()
    b64 = base64.b64encode(image_bytes).decode()
    messages.append({"role": "user", "content": [
        {"type": "text", "text": "<observation>\n"},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "\n</observation>"},
    ]})
    trim_to_last_n_images(messages, n=3)

    resp = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=0.8,
        extra_body={"structured_outputs": {"json": schema}},
    )
    step = Step.model_validate_json(resp.choices[0].message.content)
    messages.append({"role": "assistant", "content": step.model_dump_json()})

    if step.tool_call.tool_name == "answer":
        return step.tool_call.content

    result = execute(step.tool_call)
    messages.append({
        "role": "user",
        "content": f'<tool_output tool="{step.tool_call.tool_name}">\n{result}\n</tool_output>',
    })

Common pitfalls

SymptomLikely cause
Clicks land far from the targetCoordinates not scaled to screenshot dimensions, or the screenshot was resized between send and execute
Model loops, forgets earlier factsDurable facts not carried forward (note empty in structured mode, or nothing written to content in function-calling mode), or older <observation> wrappers dropped instead of stripped to text
Context window fills upImage budget not enforced
Reasoning leaks into the action<think>...</think> written inline in content instead of read from reasoning_content
Quality collapses after one bad stepRaw model output replayed in history instead of the parsed result
(Structured) Model returns free-form textextra_body.structured_outputs.json is missing, or the schema lacks Literal[tool_name] discrimination
(Structured) Tool result has no effectSent as a tool-role message instead of a user message with a <tool_output> wrapper
(Function calling) Tool calls come back as plain texttool_choice not set to required, or the system prompt contains conflicting tool-format examples
(Function calling) Tool result ignoredSent as a user message instead of a tool-role message with a matching tool_call_id

Next steps

Element localization

Get click coordinates from a screenshot.

API reference

Endpoint, models, parameters, and limits.

Quickstart

Back to setup and your first call.