Skip to main content
POST
/
v1
/
chat
/
completions
Create chat completion
curl --request POST \
  --url https://api.hcompany.ai/v1/chat/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "messages": [
    {}
  ],
  "structured_outputs": {},
  "chat_template_kwargs": {},
  "reasoning_effort": "<string>",
  "tools": [
    {}
  ],
  "tool_choice": "<string>",
  "stream": true,
  "max_tokens": 123,
  "temperature": 123
}
'
{
  "choices[].message.content": "<string>",
  "choices[].message.reasoning": "<string>",
  "choices[].message.tool_calls": [
    {}
  ],
  "choices[].finish_reason": "<string>",
  "usage": {}
}
The single inference endpoint. It is OpenAI-compatible: the official OpenAI clients work as-is with base_url pointed at https://api.hcompany.ai/v1/. Holo-specific behavior (structured outputs, the reasoning toggle) is controlled by extra body fields documented below. Returns a chat completion object, or a stream of chunk objects when stream is true.

Body parameters

model
string
required
Model ID to run. One of the IDs listed on the Models page, e.g. holo3-1-35b-a3b.
messages
array
required
The conversation so far. Standard OpenAI message objects (role, content); content can be a string or an array of text and image_url parts. Images accept HTTPS URLs or base64 data URIs (JPEG, PNG, WebP), up to 5 per request.
structured_outputs
object
Holo-specific. Constrain the response, at the decoding level, to a JSON object matching a schema: pass {"json": <JSON Schema>}. The object is returned in message.content. Use this for the structured-output agent loop and element localization.
With the OpenAI SDKs, pass this (and chat_template_kwargs) via extra_body in Python or an untyped spread in TypeScript; the SDK merges them into the request body. On the raw wire they are top-level fields, as in the cURL example. The API silently ignores a body nested under a literal "extra_body" key.
chat_template_kwargs
object
Holo-specific. {"enable_thinking": bool} toggles the reasoning channel. Use true for agent loops (Holo plans before acting), false for single-shot calls like grounding and OCR.
reasoning_effort
string
How much the model plans before acting: "low", "medium", or "high". "medium" is a sensible default for agent loops.
tools
array
OpenAI-style function declarations for native function calling. Supported by holo3-1-35b-a3b only. Set tool_choice: "required" so the model acts on every step, and do not mix with structured_outputs.
tool_choice
string
Standard OpenAI semantics. Use "required" in function-calling agent loops.
stream
boolean
default:"false"
Stream the response as server-sent chunk events. Reasoning tokens arrive in delta.reasoning, content in delta.content.
max_tokens
integer
Output cap for this request. The hard per-model ceilings differ: 4,096 for holo3-1-35b-a3b, 32,768 for holo3-122b-a10b (see Models).
temperature
number
Sampling temperature. Use 0.0 for deterministic single-shot calls (localization, OCR); 0.8 works well in agent loops. Also supported: top_p, top_k, stop, frequency_penalty, presence_penalty, seed.

Response

choices[].message.content
string
The action or answer: the constrained JSON object (structured-output mode) or the assistant text. null when the model responded with tool_calls only.
choices[].message.reasoning
string
The thinking trace, present when thinking is enabled. Read it for visibility; do not feed it back into the conversation. The chat template drops it between turns, so anything the model must remember has to flow through content. See the Agent loop for carrying state forward.
choices[].message.tool_calls
array
Present in native function-calling mode only. Each call carries an id and a function object with name and a JSON-encoded arguments string.
choices[].finish_reason
string
stop, length (hit max_tokens or the model ceiling), or tool_calls.
usage
object
prompt_tokens, completion_tokens, total_tokens for the request.

Examples

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.hcompany.ai/v1/",
    api_key=os.environ["HAI_API_KEY"],
)

resp = client.chat.completions.create(
    model="holo3-1-35b-a3b",
    messages=[{"role": "user", "content": "In one sentence, what is a computer-use agent?"}],
    reasoning_effort="medium",
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

print(resp.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="holo3-1-35b-a3b",
    messages=[{"role": "user", "content": "In one sentence, what is a computer-use agent?"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)