- Structured outputs: the model returns a single constrained JSON object per step. Works on Holo3.1 and Holo3.
- Native function calling: the model returns OpenAI-style
tool_calls. Holo3.1 only; Holo3 does not support it.
Reasoning
Holo returns two streams on every call: a thinking trace inreasoning_content and the action in content. Reasoning is essential in agent mode (Holo was trained to plan in reasoning_content before each step), so leave it on; reasoning_effort: "medium" is a sensible default.
content (that is what the note field, below, is for). When re-adding the assistant message to the conversation, push only the parsed output; do not splice the reasoning back in.
Coordinates in [0, 1000]
Send a screenshot at any size. Holo returns coordinates as integers in [0, 1000], normalized to that image. Scale back to pixels using its dimensions:
Image budget
Keep at most the last 3 screenshots in context; more degrades accuracy. Older screenshots should be replaced with a short text placeholder, while keeping the<observation> wrapper. This works the same in both output formats, since observations are always user messages:
Output format and tool calls
- Structured outputs (Holo3.1, Holo3)
- Native function calling (Holo3.1)
The model is constrained, at the decoding level, to emit a single JSON object matching a schema you provide. Tool calls are fields inside that object, so output is always valid JSON.Embed the same schema inside the system prompt under an Pass the schema to
Wrap tool results as
Output JSON
Each step, the model emits one object with three fields:note is the model’s durable memory: anything from the current screen that future steps will need (URLs, IDs, intermediate answers). Set it to null when nothing new is worth recording. thought is a one-line plan for the next action. tool_call is flat: tool_name is a sibling of the arguments, not nested in an args object.Constrain output to a tool union
Define each tool as a Pydantic model with aLiteral[tool_name] field, then use their union as the response schema. The server’s constrained decoder ensures the model emits exactly one variant, and tool_name is the tag you dispatch on at execution time. The example below ships three tools (click, write, answer) for illustration; real agents register a wider toolbox following the same pattern.<output_format> block (shown in the loop below). The model was trained with the schema visible in both the prompt and structured_outputs, and dropping either copy noticeably hurts reliability.Use
extra_body={"structured_outputs": {"json": ...}}, not OpenAI native function calling (tools=[...] / tool_choice=...). In this mode the model emits a flat {note, thought, tool_call} object in content, not a tool_calls array.structured_outputs, then parse content back into your models. Because tool_name is a discriminator, the parsed tool_call narrows to exactly one variant, which is what you dispatch on:Chat layout
User observations alternate with assistant JSON; tool results come back asuser messages:| Role | Body |
|---|---|
system | your prompt, then the appended <output_format> schema block |
user | <observation> + screenshot and/or text + </observation> |
assistant | the JSON object: {note, thought, tool_call} |
user | <tool_output tool="click"> + result + </tool_output> |
user | next <observation> |
assistant | next JSON |
user messages with <tool_output tool="...">, not as OpenAI tool-role messages.A complete loop
Plug in your ownscreenshot() (browser, OS, emulator) and execute(...) dispatcher.Common pitfalls
| Symptom | Likely cause |
|---|---|
| Clicks land far from the target | Coordinates not scaled to screenshot dimensions, or the screenshot was resized between send and execute |
| Model loops, forgets earlier facts | Durable facts not carried forward (note empty in structured mode, or nothing written to content in function-calling mode), or older <observation> wrappers dropped instead of stripped to text |
| Context window fills up | Image budget not enforced |
| Reasoning leaks into the action | <think>...</think> written inline in content instead of read from reasoning_content |
| Quality collapses after one bad step | Raw model output replayed in history instead of the parsed result |
| (Structured) Model returns free-form text | extra_body.structured_outputs.json is missing, or the schema lacks Literal[tool_name] discrimination |
| (Structured) Tool result has no effect | Sent as a tool-role message instead of a user message with a <tool_output> wrapper |
| (Function calling) Tool calls come back as plain text | tool_choice not set to required, or the system prompt contains conflicting tool-format examples |
| (Function calling) Tool result ignored | Sent as a user message instead of a tool-role message with a matching tool_call_id |
Next steps
Element localization
Get click coordinates from a screenshot.
API reference
Endpoint, models, parameters, and limits.
Quickstart
Back to setup and your first call.