Agent loop - H Tech Hub

Holo3 is trained to act as a multi-step agent inside a specific harness, and a few of those conventions have to come along for the model to behave well in yours: an output JSON shape, a chat layout for screenshots and tool results, an image budget, and a coordinate convention. Skip any one and quality suffers. Same contract on both models. Set up the OpenAI client first by following the Quickstart.

Reasoning

Holo3 returns two streams on every call: a thinking trace in reasoning_content and the structured JSON in content. Reasoning is essential in agent mode (Holo3 was trained to plan in reasoning_content before each tool call), so leave it on; reasoning_effort: "medium" is a sensible default.

extra_body={"chat_template_kwargs": {"enable_thinking": False}}

Past reasoning is dropped between turns by the Qwen 3.5 chat template Holo3 inherits, so anything the model needs to remember has to flow through content (that’s what note and thought are for, below). When re-adding the assistant message to the conversation, push only the parsed JSON; don’t splice the reasoning back in.

Output JSON

Each step, Holo3 emits a single JSON object with three fields:

{
  "note": "Submit succeeded; receipt URL is /orders/8421.",
  "thought": "Recording the receipt URL before navigating away.",
  "tool_call": {
    "tool_name": "click",
    "element": "Continue button at the bottom right",
    "x": 932,
    "y": 880
  }
}

note is the model’s durable memory: anything from the current screen that future steps will need (URLs, IDs, intermediate answers). Set it to null when nothing new is worth recording. thought is a one-line plan for the next action. tool_call is flat: tool_name is a sibling of the arguments, not nested in an args object.

Constrain output to a tool union

Define each tool as a Pydantic model with a Literal[tool_name] field, then use their union as the response schema. The server’s constrained decoder ensures the model emits exactly one variant, and tool_name is the tag you dispatch on at execution time. The example below ships three tools (click, write, answer) for illustration; real agents typically register a wider toolbox following the same pattern.

from typing import Literal
from pydantic import BaseModel, Field

MouseButton = Literal["left", "right", "middle"]

class ClickArgs(BaseModel):
    """Click at (x, y) coordinates"""
    tool_name: Literal["click"]
    element: str = Field(description="Detailed description of the target UI element to click on")
    x: int = Field(description="X coordinate as integer in [0, 1000]")
    y: int = Field(description="Y coordinate as integer in [0, 1000]")
    button: MouseButton = Field(default="left", description="Mouse button to click (left, right, middle)")

class WriteArgs(BaseModel):
    """Type text into the currently focused element without clicking first"""
    tool_name: Literal["write"]
    content: str = Field(description="Content to write")
    press_enter: bool = Field(default=False, description="Whether to press Enter after typing")
    overwrite: bool = Field(default=False, description="Whether to clear existing text before typing")

class AnswerArgs(BaseModel):
    """Provide a final answer"""
    tool_name: Literal["answer"]
    content: str = Field(description="The answer content")

class Step(BaseModel):
    note: str | None = Field(
        default=None,
        description="Task-relevant information extracted from the previous observation. Keep empty if no new info.",
    )
    thought: str = Field(description="Reasoning about next steps")
    tool_call: ClickArgs | WriteArgs | AnswerArgs

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    extra_body={"structured_outputs": {"json": Step.model_json_schema()}},
    temperature=0.8,
)
step = Step.model_validate_json(resp.choices[0].message.content)

Embed the same schema inside the system prompt under an <output_format> block (shown in Putting it together). Holo3 was trained with the schema visible in both places, and dropping either copy noticeably hurts reliability.

Use extra_body={"structured_outputs": {"json": ...}}, not OpenAI native function calling (tools=[...] / tool_choice=...). Holo3 supports a flat {note, thought, tool_call} JSON output, not OpenAI-style tool_calls arrays; routing through native function calling silently degrades quality.

Coordinates in `[0, 1000]`

Send a screenshot at any size. Holo3 returns coordinates as integers in [0, 1000], normalized to that image. Scale back to pixels using its dimensions:

abs_x = int((step.tool_call.x / 1000) * screenshot.width)
abs_y = int((step.tool_call.y / 1000) * screenshot.height)

Origin is top-left. Send and scale against the same image bytes; any resize, crop, or DPI mismatch will misclick. Pick one pixel unit (CSS or device) and stay in it end to end.

Chat layout

User observations alternate with assistant JSON; tool results come back as user messages:

Role	Body
`system`	your prompt, then the appended `<output_format>` schema block
`user`	`<observation>` + screenshot and/or text + `</observation>`
`assistant`	the JSON object: `{note, thought, tool_call}`
`user`	`<tool_output tool="click">` + result + `</tool_output>`
`user`	next `<observation>`
`assistant`	next JSON

Wrap tool results as user messages with <tool_output tool="...">, not as OpenAI tool-role messages.

Image budget

Keep at most the last 3 screenshots in context; more degrades accuracy. Older screenshots should be replaced with a short text placeholder, while keeping the <observation> wrapper:

def trim_to_last_n_images(messages, n=3):
    seen = 0
    for msg in reversed(messages):
        if msg["role"] != "user" or not isinstance(msg["content"], list):
            continue
        for chunk in msg["content"]:
            if chunk.get("type") != "image_url":
                continue
            seen += 1
            if seen > n:
                chunk["type"] = "text"
                chunk["text"] = "[screenshot evicted]"
                chunk.pop("image_url", None)

Putting it together

A complete loop. Plug in your own screenshot() (browser, OS, emulator) and execute(...) dispatcher.

import json, base64

schema = Step.model_json_schema()
system = render_prompt(tools=...) + f"\n\n<output_format>\n```json\n{json.dumps(schema)}\n```\n</output_format>"

messages = [{"role": "system", "content": system}]

for _ in range(MAX_STEPS):
    image_bytes = screenshot()
    b64 = base64.b64encode(image_bytes).decode()
    messages.append({"role": "user", "content": [
        {"type": "text", "text": "<observation>\n"},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "\n</observation>"},
    ]})
    trim_to_last_n_images(messages, n=3)

    resp = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=0.8,
        extra_body={"structured_outputs": {"json": schema}},
    )
    step = Step.model_validate_json(resp.choices[0].message.content)
    messages.append({"role": "assistant", "content": step.model_dump_json()})

    if step.tool_call.tool_name == "answer":
        return step.tool_call.content

    result = execute(step.tool_call)
    messages.append({
        "role": "user",
        "content": f'<tool_output tool="{step.tool_call.tool_name}">\n{result}\n</tool_output>',
    })

Common pitfalls

Symptom	Likely cause
Clicks land far from the target	Coordinates not scaled to screenshot dimensions, or the screenshot was resized between send and execute
Model loops, forgets earlier facts	`note` is empty, or older `<observation>` wrappers were dropped instead of stripped to text
Model returns free-form text	`extra_body.structured_outputs.json` is missing, or the schema lacks `Literal[tool_name]` discrimination
Context window fills up	Image budget not enforced
Tool results have no effect	Sent as `tool` role instead of `user` with `<tool_output>` wrapper
Quality collapses after one bad step	Raw model output replayed in history instead of the parsed JSON
Reasoning leaks into the parsed JSON	`<think>...</think>` written inline in `content` instead of read from `reasoning_content`
Output lands in `message.tool_calls` instead of `content`	Used OpenAI native function calling (`tools=[...]`) instead of `extra_body.structured_outputs.json`

Documentation Index

​Reasoning

​Output JSON

​Constrain output to a tool union

​Coordinates in [0, 1000]

​Chat layout

​Image budget

​Putting it together

​Common pitfalls

Reasoning

Output JSON

Constrain output to a tool union

Coordinates in `[0, 1000]`

Chat layout

Image budget

Putting it together

Common pitfalls