> ## Documentation Index
> Fetch the complete documentation index at: https://hub.hcompany.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Agent loop

Holo3 is trained to act as a multi-step agent inside a specific harness, and a few of those conventions have to come along for the model to behave well in yours: an output JSON shape, a chat layout for screenshots and tool results, an image budget, and a coordinate convention. Skip any one and quality suffers. Same contract on both models.

Set up the OpenAI client first by following the [Quickstart](/quickstart).

## Reasoning

Holo3 returns two streams on every call: a thinking trace in `reasoning_content` and the structured JSON in `content`. Reasoning is essential in agent mode (Holo3 was trained to plan in `reasoning_content` before each tool call), so leave it on; `reasoning_effort: "medium"` is a sensible default.

```python theme={null}
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
```

Past reasoning is dropped between turns by the [Qwen 3.5 chat template](https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/chat_template.jinja) Holo3 inherits, so anything the model needs to remember has to flow through `content` (that's what `note` and `thought` are for, below). When re-adding the assistant message to the conversation, push only the parsed JSON; don't splice the reasoning back in.

## Output JSON

Each step, Holo3 emits a single JSON object with three fields:

```json theme={null}
{
  "note": "Submit succeeded; receipt URL is /orders/8421.",
  "thought": "Recording the receipt URL before navigating away.",
  "tool_call": {
    "tool_name": "click",
    "element": "Continue button at the bottom right",
    "x": 932,
    "y": 880
  }
}
```

`note` is the model's durable memory: anything from the current screen that future steps will need (URLs, IDs, intermediate answers). Set it to `null` when nothing new is worth recording. `thought` is a one-line plan for the next action.

`tool_call` is flat: `tool_name` is a sibling of the arguments, not nested in an `args` object.

## Constrain output to a tool union

Define each tool as a Pydantic model with a `Literal[tool_name]` field, then use their union as the response schema. The server's constrained decoder ensures the model emits exactly one variant, and `tool_name` is the tag you dispatch on at execution time.

The example below ships three tools (click, write, answer) for illustration; real agents typically register a wider toolbox following the same pattern.

```python theme={null}
from typing import Literal
from pydantic import BaseModel, Field

MouseButton = Literal["left", "right", "middle"]

class ClickArgs(BaseModel):
    """Click at (x, y) coordinates"""
    tool_name: Literal["click"]
    element: str = Field(description="Detailed description of the target UI element to click on")
    x: int = Field(description="X coordinate as integer in [0, 1000]")
    y: int = Field(description="Y coordinate as integer in [0, 1000]")
    button: MouseButton = Field(default="left", description="Mouse button to click (left, right, middle)")

class WriteArgs(BaseModel):
    """Type text into the currently focused element without clicking first"""
    tool_name: Literal["write"]
    content: str = Field(description="Content to write")
    press_enter: bool = Field(default=False, description="Whether to press Enter after typing")
    overwrite: bool = Field(default=False, description="Whether to clear existing text before typing")

class AnswerArgs(BaseModel):
    """Provide a final answer"""
    tool_name: Literal["answer"]
    content: str = Field(description="The answer content")

class Step(BaseModel):
    note: str | None = Field(
        default=None,
        description="Task-relevant information extracted from the previous observation. Keep empty if no new info.",
    )
    thought: str = Field(description="Reasoning about next steps")
    tool_call: ClickArgs | WriteArgs | AnswerArgs

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    extra_body={"structured_outputs": {"json": Step.model_json_schema()}},
    temperature=0.8,
)
step = Step.model_validate_json(resp.choices[0].message.content)
```

Embed the same schema inside the system prompt under an `<output_format>` block (shown in [Putting it together](#putting-it-together)). Holo3 was trained with the schema visible in both places, and dropping either copy noticeably hurts reliability.

<Note>
  Use `extra_body={"structured_outputs": {"json": ...}}`, not OpenAI native function calling (`tools=[...]` / `tool_choice=...`). Holo3 supports a flat `{note, thought, tool_call}` JSON output, not OpenAI-style `tool_calls` arrays; routing through native function calling silently degrades quality.
</Note>

## Coordinates in `[0, 1000]`

Send a screenshot at any size. Holo3 returns coordinates as integers in `[0, 1000]`, normalized to that image. Scale back to pixels using its dimensions:

```python theme={null}
abs_x = int((step.tool_call.x / 1000) * screenshot.width)
abs_y = int((step.tool_call.y / 1000) * screenshot.height)
```

Origin is top-left. Send and scale against the same image bytes; any resize, crop, or DPI mismatch will misclick. Pick one pixel unit (CSS or device) and stay in it end to end.

## Chat layout

User observations alternate with assistant JSON; tool results come back as user messages:

| Role        | Body                                                          |
| :---------- | :------------------------------------------------------------ |
| `system`    | your prompt, then the appended `<output_format>` schema block |
| `user`      | `<observation>` + screenshot and/or text + `</observation>`   |
| `assistant` | the JSON object: `{note, thought, tool_call}`                 |
| `user`      | `<tool_output tool="click">` + result + `</tool_output>`      |
| `user`      | next `<observation>`                                          |
| `assistant` | next JSON                                                     |

Wrap tool results as `user` messages with `<tool_output tool="...">`, not as OpenAI `tool`-role messages.

## Image budget

Keep at most the last 3 screenshots in context; more degrades accuracy. Older screenshots should be replaced with a short text placeholder, while keeping the `<observation>` wrapper:

```python theme={null}
def trim_to_last_n_images(messages, n=3):
    seen = 0
    for msg in reversed(messages):
        if msg["role"] != "user" or not isinstance(msg["content"], list):
            continue
        for chunk in msg["content"]:
            if chunk.get("type") != "image_url":
                continue
            seen += 1
            if seen > n:
                chunk["type"] = "text"
                chunk["text"] = "[screenshot evicted]"
                chunk.pop("image_url", None)
```

## Putting it together

A complete loop. Plug in your own `screenshot()` (browser, OS, emulator) and `execute(...)` dispatcher.

````python theme={null}
import json, base64

schema = Step.model_json_schema()
system = render_prompt(tools=...) + f"\n\n<output_format>\n```json\n{json.dumps(schema)}\n```\n</output_format>"

messages = [{"role": "system", "content": system}]

for _ in range(MAX_STEPS):
    image_bytes = screenshot()
    b64 = base64.b64encode(image_bytes).decode()
    messages.append({"role": "user", "content": [
        {"type": "text", "text": "<observation>\n"},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "\n</observation>"},
    ]})
    trim_to_last_n_images(messages, n=3)

    resp = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        temperature=0.8,
        extra_body={"structured_outputs": {"json": schema}},
    )
    step = Step.model_validate_json(resp.choices[0].message.content)
    messages.append({"role": "assistant", "content": step.model_dump_json()})

    if step.tool_call.tool_name == "answer":
        return step.tool_call.content

    result = execute(step.tool_call)
    messages.append({
        "role": "user",
        "content": f'<tool_output tool="{step.tool_call.tool_name}">\n{result}\n</tool_output>',
    })
````

## Common pitfalls

| Symptom                                                   | Likely cause                                                                                             |
| :-------------------------------------------------------- | :------------------------------------------------------------------------------------------------------- |
| Clicks land far from the target                           | Coordinates not scaled to screenshot dimensions, or the screenshot was resized between send and execute  |
| Model loops, forgets earlier facts                        | `note` is empty, or older `<observation>` wrappers were dropped instead of stripped to text              |
| Model returns free-form text                              | `extra_body.structured_outputs.json` is missing, or the schema lacks `Literal[tool_name]` discrimination |
| Context window fills up                                   | Image budget not enforced                                                                                |
| Tool results have no effect                               | Sent as `tool` role instead of `user` with `<tool_output>` wrapper                                       |
| Quality collapses after one bad step                      | Raw model output replayed in history instead of the parsed JSON                                          |
| Reasoning leaks into the parsed JSON                      | `<think>...</think>` written inline in `content` instead of read from `reasoning_content`                |
| Output lands in `message.tool_calls` instead of `content` | Used OpenAI native function calling (`tools=[...]`) instead of `extra_body.structured_outputs.json`      |
