Create chat completion
Endpoints
Create chat completion
OpenAI-compatible chat completion with Holo-specific structured outputs and reasoning.
POST
Create chat completion
The single inference endpoint. It is OpenAI-compatible: the official OpenAI clients work as-is with
base_url pointed at https://api.hcompany.ai/v1/. Holo-specific behavior (structured outputs, the reasoning toggle) is controlled by extra body fields documented below.
Returns a chat completion object, or a stream of chunk objects when stream is true.
Body parameters
Model ID to run. One of the IDs listed on the Models page, e.g.
holo3-1-35b-a3b.The conversation so far. Standard OpenAI message objects (
role, content); content can be a string or an array of text and image_url parts. Images accept HTTPS URLs or base64 data URIs (JPEG, PNG, WebP), up to 5 per request.Holo-specific. Constrain the response, at the decoding level, to a JSON object matching a schema: pass
{"json": <JSON Schema>}. The object is returned in message.content. Use this for the structured-output agent loop and element localization.With the OpenAI SDKs, pass this (and
chat_template_kwargs) via extra_body in Python or an untyped spread in TypeScript; the SDK merges them into the request body. On the raw wire they are top-level fields, as in the cURL example. The API silently ignores a body nested under a literal "extra_body" key.Holo-specific.
{"enable_thinking": bool} toggles the reasoning channel. Use true for agent loops (Holo plans before acting), false for single-shot calls like grounding and OCR.How much the model plans before acting:
"low", "medium", or "high". "medium" is a sensible default for agent loops.OpenAI-style function declarations for native function calling. Supported by
holo3-1-35b-a3b only. Set tool_choice: "required" so the model acts on every step, and do not mix with structured_outputs.Standard OpenAI semantics. Use
"required" in function-calling agent loops.Stream the response as server-sent chunk events. Reasoning tokens arrive in
delta.reasoning, content in delta.content.Output cap for this request. The hard per-model ceilings differ: 4,096 for
holo3-1-35b-a3b, 32,768 for holo3-122b-a10b (see Models).Sampling temperature. Use
0.0 for deterministic single-shot calls (localization, OCR); 0.8 works well in agent loops. Also supported: top_p, top_k, stop, frequency_penalty, presence_penalty, seed.Response
The action or answer: the constrained JSON object (structured-output mode) or the assistant text.
null when the model responded with tool_calls only.The thinking trace, present when thinking is enabled. Read it for visibility; do not feed it back into the conversation. The chat template drops it between turns, so anything the model must remember has to flow through
content. See the Agent loop for carrying state forward.Present in native function-calling mode only. Each call carries an
id and a function object with name and a JSON-encoded arguments string.stop, length (hit max_tokens or the model ceiling), or tool_calls.prompt_tokens, completion_tokens, total_tokens for the request.