Deploy via Docker container

The instructions below show you how to serve our models via Docker container.

Run a vLLM server locally

This section shows you how to install the vLLM run the server locally.

Pre-requisites

An NVIDIA GPU with drivers installed

Installation

Install the vLLM using the instructions provided by vLLM
Install a compatible version of transformers:

pip install "transformers<4.53.0"

Example

The snippet below shows you how to run Holo1 3B

vllm serve Hcompany/Holo1-3B \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt 'image=3,video=0' \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --max-model-len 16384

Deploy via Docker

This section shows you how to deploy our models via Docker container.

Pre-requisites

An NVIDIA GPU with drivers installed
NVIDIA Container Toolkit to allow Docker to access your GPU
Docker installed and running

Example

The command below shows you how to run a holo model.

docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.9.1 \
    --model HCompany/Holo1-3B \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt 'image=3,video=0' \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --max-model-len 16384

Keep in mindTo run a different Holo model, simply change —model to HCompany/Holo1-7B, for example.

Invoking Holo via API

When the vLLM is running, you can send requests to:

http://localhost:8000/v1/chat/completions

Test with curl

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "HCompany/Holo1-3B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Test with Python (OpenAI SDK)

Install OpenAI client:

pip install openai

Example Python script:

from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "HCompany/Holo1-3B"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

print(chat_completion.choices[0].message.content)

Keep in mindThe API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.

Notes

--model can be set to HCompany/Holo1-3B or # HCompany/Holo1-7B or any other applicable Holo model available at the time.
--gpus=all enables all NVIDIA GPUs for the container.
Our Holo models are multimodal, so you can adjust image/video limits using --limit-mm-per-prompt.
Reduce --max-model-len or --gpu-memory-utilization if your GPU runs out of memory.
Ensure your GPU supports bfloat16 (e.g., A100, L40S, RTX 4090, etc.), use float16 otherwise.
Port 8000 must be free; change it with -p <host>:8000 if needed.

Examples

The endpoint is in service. You can use OpenAI Client to perform real time inference on the deployed Holo model.

Surfer-H

​Run a vLLM server locally

​Pre-requisites

​Installation

​Example

​Deploy via Docker

​Pre-requisites

​Example

​Invoking Holo via API

​Test with curl

​Test with Python (OpenAI SDK)

​Notes

​Examples

Run a vLLM server locally

Pre-requisites

Installation

Example

Deploy via Docker

Pre-requisites

Example

Invoking Holo via API

Test with curl

Test with Python (OpenAI SDK)

Notes

Examples