Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hub.hcompany.ai/llms.txt

Use this file to discover all available pages before exploring further.

Holo3 is trained to act as a multi-step agent inside a specific harness, and a few of those conventions have to come along for the model to behave well in yours: an output JSON shape, a chat layout for screenshots and tool results, an image budget, and a coordinate convention. Skip any one and quality suffers. Same contract on both models. Set up the OpenAI client first by following the Quickstart.

Reasoning

Holo3 returns two streams on every call: a thinking trace in reasoning_content and the structured JSON in content. Reasoning is essential in agent mode (Holo3 was trained to plan in reasoning_content before each tool call), so leave it on; reasoning_effort: "medium" is a sensible default.
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
Past reasoning is dropped between turns by the Qwen 3.5 chat template Holo3 inherits, so anything the model needs to remember has to flow through content (that’s what note and thought are for, below). When re-adding the assistant message to the conversation, push only the parsed JSON; don’t splice the reasoning back in.

Output JSON

Each step, Holo3 emits a single JSON object with three fields:
{
  "note": "Submit succeeded; receipt URL is /orders/8421.",
  "thought": "Recording the receipt URL before navigating away.",
  "tool_call": {
    "tool_name": "click",
    "element": "Continue button at the bottom right",
    "x": 932,
    "y": 880
  }
}
note is the model’s durable memory: anything from the current screen that future steps will need (URLs, IDs, intermediate answers). Set it to null when nothing new is worth recording. thought is a one-line plan for the next action. tool_call is flat: tool_name is a sibling of the arguments, not nested in an args object.

Constrain output to a tool union

Define each tool as a Pydantic model with a Literal[tool_name] field, then use their union as the response schema. The server’s constrained decoder ensures the model emits exactly one variant, and tool_name is the tag you dispatch on at execution time. The example below ships three tools (click, write, answer) for illustration; real agents typically register a wider toolbox following the same pattern.
from typing import Literal
from pydantic import BaseModel, Field

MouseButton = Literal["left", "right", "middle"]

class ClickArgs(BaseModel):
    """Click at (x, y) coordinates"""
    tool_name: Literal["click"]
    element: str = Field(description="Detailed description of the target UI element to click on")
    x: int = Field(description="X coordinate as integer in [0, 1000]")
    y: int = Field(description="Y coordinate as integer in [0, 1000]")
    button: MouseButton = Field(default="left", description="Mouse button to click (left, right, middle)")

class WriteArgs(BaseModel):
    """Type text into the currently focused element without clicking first"""
    tool_name: Literal["write"]
    content: str = Field(description="Content to write")
    press_enter: bool = Field(default=False, description="Whether to press Enter after typing")
    overwrite: bool = Field(default=False, description="Whether to clear existing text before typing")

class AnswerArgs(BaseModel):
    """Provide a final answer"""
    tool_name: Literal["answer"]
    content: str = Field(description="The answer content")

class Step(BaseModel):
    note: str | None = Field(
        default=None,
        description="Task-relevant information extracted from the previous observation. Keep empty if no new info.",
    )
    thought: str = Field(description="Reasoning about next steps")
    tool_call: ClickArgs | WriteArgs | AnswerArgs

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    extra_body={"structured_outputs": {"json": Step.model_json_schema()}},
    temperature=0.8,
)
step = Step.model_validate_json(resp.choices[0].message.content)
Embed the same schema inside the system prompt under an <output_format> block (shown in Putting it together). Holo3 was trained with the schema visible in both places, and dropping either copy noticeably hurts reliability.
Use extra_body={"structured_outputs": {"json": ...}}, not OpenAI native function calling (tools=[...] / tool_choice=...). Holo3 supports a flat {note, thought, tool_call} JSON output, not OpenAI-style tool_calls arrays; routing through native function calling silently degrades quality.

Coordinates in [0, 1000]

Send a screenshot at any size. Holo3 returns coordinates as integers in [0, 1000], normalized to that image. Scale back to pixels using its dimensions:
abs_x = int((step.tool_call.x / 1000) * screenshot.width)
abs_y = int((step.tool_call.y / 1000) * screenshot.height)
Origin is top-left. Send and scale against the same image bytes; any resize, crop, or DPI mismatch will misclick. Pick one pixel unit (CSS or device) and stay in it end to end.

Chat layout

User observations alternate with assistant JSON; tool results come back as user messages:
RoleBody
systemyour prompt, then the appended <output_format> schema block
user<observation> + screenshot and/or text + </observation>
assistantthe JSON object: {note, thought, tool_call}
user<tool_output tool="click"> + result + </tool_output>
usernext <observation>
assistantnext JSON
Wrap tool results as user messages with <tool_output tool="...">, not as OpenAI tool-role messages.

Image budget

Keep at most the last 3 screenshots in context; more degrades accuracy. Older screenshots should be replaced with a short text placeholder, while keeping the <observation> wrapper:
def trim_to_last_n_images(messages, n=3):
    seen = 0
    for msg in reversed(messages):
        if msg["role"] != "user" or not isinstance(msg["content"], list):
            continue
        for chunk in msg["content"]:
            if chunk.get("type") != "image_url":
                continue
            seen += 1
            if seen > n:
                chunk["type"] = "text"
                chunk["text"] = "[screenshot evicted]"
                chunk.pop("image_url", None)

Putting it together

A complete loop. Plug in your own screenshot() (browser, OS, emulator) and execute(...) dispatcher.
import json, base64

schema = Step.model_json_schema()
system = render_prompt(tools=...) + f"\n\n<output_format>\n```json\n{json.dumps(schema)}\n```\n</output_format>"

messages = [{"role": "system", "content": system}]

for _ in range(MAX_STEPS):
    image_bytes = screenshot()
    b64 = base64.b64encode(image_bytes).decode()
    messages.append({"role": "user", "content": [
        {"type": "text", "text": "<observation>\n"},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "\n</observation>"},
    ]})
    trim_to_last_n_images(messages, n=3)

    resp = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=0.8,
        extra_body={"structured_outputs": {"json": schema}},
    )
    step = Step.model_validate_json(resp.choices[0].message.content)
    messages.append({"role": "assistant", "content": step.model_dump_json()})

    if step.tool_call.tool_name == "answer":
        return step.tool_call.content

    result = execute(step.tool_call)
    messages.append({
        "role": "user",
        "content": f'<tool_output tool="{step.tool_call.tool_name}">\n{result}\n</tool_output>',
    })

Common pitfalls

SymptomLikely cause
Clicks land far from the targetCoordinates not scaled to screenshot dimensions, or the screenshot was resized between send and execute
Model loops, forgets earlier factsnote is empty, or older <observation> wrappers were dropped instead of stripped to text
Model returns free-form textextra_body.structured_outputs.json is missing, or the schema lacks Literal[tool_name] discrimination
Context window fills upImage budget not enforced
Tool results have no effectSent as tool role instead of user with <tool_output> wrapper
Quality collapses after one bad stepRaw model output replayed in history instead of the parsed JSON
Reasoning leaks into the parsed JSON<think>...</think> written inline in content instead of read from reasoning_content
Output lands in message.tool_calls instead of contentUsed OpenAI native function calling (tools=[...]) instead of extra_body.structured_outputs.json