Create chat completion

curl --request POST \
  --url https://api.hcompany.ai/v1/chat/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "messages": [
    {}
  ],
  "structured_outputs": {},
  "chat_template_kwargs": {},
  "reasoning_effort": "<string>",
  "tools": [
    {}
  ],
  "tool_choice": "<string>",
  "stream": true,
  "max_tokens": 123,
  "temperature": 123
}
'

{
  "choices[].message.content": "<string>",
  "choices[].message.reasoning": "<string>",
  "choices[].message.tool_calls": [
    {}
  ],
  "choices[].finish_reason": "<string>",
  "usage": {}
}

POST

chat

completions

Create chat completion

curl --request POST \
  --url https://api.hcompany.ai/v1/chat/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "messages": [
    {}
  ],
  "structured_outputs": {},
  "chat_template_kwargs": {},
  "reasoning_effort": "<string>",
  "tools": [
    {}
  ],
  "tool_choice": "<string>",
  "stream": true,
  "max_tokens": 123,
  "temperature": 123
}
'

{
  "choices[].message.content": "<string>",
  "choices[].message.reasoning": "<string>",
  "choices[].message.tool_calls": [
    {}
  ],
  "choices[].finish_reason": "<string>",
  "usage": {}
}

The single inference endpoint. It is OpenAI-compatible: the official OpenAI clients work as-is with base_url pointed at https://api.hcompany.ai/v1/. Holo-specific behavior (structured outputs, the reasoning toggle) is controlled by extra body fields documented below. Returns a chat completion object, or a stream of chunk objects when stream is true.

Body parameters

model

string

required

Model ID to run. One of the IDs listed on the Models page, e.g. holo3-1-35b-a3b.

messages

array

required

The conversation so far. Standard OpenAI message objects (role, content); content can be a string or an array of text and image_url parts. Images accept HTTPS URLs or base64 data URIs (JPEG, PNG, WebP), up to 5 per request.

structured_outputs

object

Holo-specific. Constrain the response, at the decoding level, to a JSON object matching a schema: pass {"json": <JSON Schema>}. The object is returned in message.content. Use this for the structured-output agent loop and element localization.

With the OpenAI SDKs, pass this (and chat_template_kwargs) via extra_body in Python or an untyped spread in TypeScript; the SDK merges them into the request body. On the raw wire they are top-level fields, as in the cURL example. The API silently ignores a body nested under a literal "extra_body" key.

chat_template_kwargs

object

Holo-specific. {"enable_thinking": bool} toggles the reasoning channel. Use true for agent loops (Holo plans before acting), false for single-shot calls like grounding and OCR.

reasoning_effort

string

How much the model plans before acting: "low", "medium", or "high". "medium" is a sensible default for agent loops.

tools

array

OpenAI-style function declarations for native function calling. Supported by holo3-1-35b-a3b only. Set tool_choice: "required" so the model acts on every step, and do not mix with structured_outputs.

tool_choice

string

Standard OpenAI semantics. Use "required" in function-calling agent loops.

stream

boolean

default:"false"

Stream the response as server-sent chunk events. Reasoning tokens arrive in delta.reasoning, content in delta.content.

max_tokens

integer

Output cap for this request. The hard per-model ceilings differ: 4,096 for holo3-1-35b-a3b, 32,768 for holo3-122b-a10b (see Models).

temperature

number

Sampling temperature. Use 0.0 for deterministic single-shot calls (localization, OCR); 0.8 works well in agent loops. Also supported: top_p, top_k, stop, frequency_penalty, presence_penalty, seed.

Response

choices[].message.content

string

The action or answer: the constrained JSON object (structured-output mode) or the assistant text. null when the model responded with tool_calls only.

choices[].message.reasoning

string

The thinking trace, present when thinking is enabled. Read it for visibility; do not feed it back into the conversation. The chat template drops it between turns, so anything the model must remember has to flow through content. See the Agent loop for carrying state forward.

choices[].message.tool_calls

array

Present in native function-calling mode only. Each call carries an id and a function object with name and a JSON-encoded arguments string.

choices[].finish_reason

string

stop, length (hit max_tokens or the model ceiling), or tool_calls.

usage

object

prompt_tokens, completion_tokens, total_tokens for the request.

Examples

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.hcompany.ai/v1/",
    api_key=os.environ["HAI_API_KEY"],
)

resp = client.chat.completions.create(
    model="holo3-1-35b-a3b",
    messages=[{"role": "user", "content": "In one sentence, what is a computer-use agent?"}],
    reasoning_effort="medium",
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

print(resp.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="holo3-1-35b-a3b",
    messages=[{"role": "user", "content": "In one sentence, what is a computer-use agent?"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

API reference List models

​Body parameters

​Response

​Examples

​Streaming

Body parameters

Response

Examples

Streaming