Holo3 is trained to act as a multi-step agent inside a specific harness, and a few of those conventions have to come along for the model to behave well in yours: an output JSON shape, a chat layout for screenshots and tool results, an image budget, and a coordinate convention. Skip any one and quality suffers. Same contract on both models. Set up the OpenAI client first by following the Quickstart.Documentation Index
Fetch the complete documentation index at: https://hub.hcompany.ai/llms.txt
Use this file to discover all available pages before exploring further.
Reasoning
Holo3 returns two streams on every call: a thinking trace inreasoning_content and the structured JSON in content. Reasoning is essential in agent mode (Holo3 was trained to plan in reasoning_content before each tool call), so leave it on; reasoning_effort: "medium" is a sensible default.
content (that’s what note and thought are for, below). When re-adding the assistant message to the conversation, push only the parsed JSON; don’t splice the reasoning back in.
Output JSON
Each step, Holo3 emits a single JSON object with three fields:note is the model’s durable memory: anything from the current screen that future steps will need (URLs, IDs, intermediate answers). Set it to null when nothing new is worth recording. thought is a one-line plan for the next action.
tool_call is flat: tool_name is a sibling of the arguments, not nested in an args object.
Constrain output to a tool union
Define each tool as a Pydantic model with aLiteral[tool_name] field, then use their union as the response schema. The server’s constrained decoder ensures the model emits exactly one variant, and tool_name is the tag you dispatch on at execution time.
The example below ships three tools (click, write, answer) for illustration; real agents typically register a wider toolbox following the same pattern.
<output_format> block (shown in Putting it together). Holo3 was trained with the schema visible in both places, and dropping either copy noticeably hurts reliability.
Use
extra_body={"structured_outputs": {"json": ...}}, not OpenAI native function calling (tools=[...] / tool_choice=...). Holo3 supports a flat {note, thought, tool_call} JSON output, not OpenAI-style tool_calls arrays; routing through native function calling silently degrades quality.Coordinates in [0, 1000]
Send a screenshot at any size. Holo3 returns coordinates as integers in [0, 1000], normalized to that image. Scale back to pixels using its dimensions:
Chat layout
User observations alternate with assistant JSON; tool results come back as user messages:| Role | Body |
|---|---|
system | your prompt, then the appended <output_format> schema block |
user | <observation> + screenshot and/or text + </observation> |
assistant | the JSON object: {note, thought, tool_call} |
user | <tool_output tool="click"> + result + </tool_output> |
user | next <observation> |
assistant | next JSON |
user messages with <tool_output tool="...">, not as OpenAI tool-role messages.
Image budget
Keep at most the last 3 screenshots in context; more degrades accuracy. Older screenshots should be replaced with a short text placeholder, while keeping the<observation> wrapper:
Putting it together
A complete loop. Plug in your ownscreenshot() (browser, OS, emulator) and execute(...) dispatcher.
Common pitfalls
| Symptom | Likely cause |
|---|---|
| Clicks land far from the target | Coordinates not scaled to screenshot dimensions, or the screenshot was resized between send and execute |
| Model loops, forgets earlier facts | note is empty, or older <observation> wrappers were dropped instead of stripped to text |
| Model returns free-form text | extra_body.structured_outputs.json is missing, or the schema lacks Literal[tool_name] discrimination |
| Context window fills up | Image budget not enforced |
| Tool results have no effect | Sent as tool role instead of user with <tool_output> wrapper |
| Quality collapses after one bad step | Raw model output replayed in history instead of the parsed JSON |
| Reasoning leaks into the parsed JSON | <think>...</think> written inline in content instead of read from reasoning_content |
Output lands in message.tool_calls instead of content | Used OpenAI native function calling (tools=[...]) instead of extra_body.structured_outputs.json |
.png)
