Deploy via Docker

The instructions below show you how to serve our models via Docker container.

Run a vLLM server locally

This section shows you how to install the vLLM run the server locally.

Pre-requisites

An NVIDIA GPU with drivers installed

Installation

Install the vLLM using the instructions provided by vLLM
Install a compatible version of transformers:

pip install "transformers<4.53.0"

Example

The snippet below shows you how to run Holo1 3B

vllm serve Hcompany/Holo1-3B \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt 'image=3,video=0' \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --max-model-len 16384

This section shows you how to deploy our models via Docker container.

Pre-requisites

An NVIDIA GPU with drivers installed
NVIDIA Container Toolkit to allow Docker to access your GPU
Docker installed and running

Example

The command below shows you how to Run Holo1 3B

docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.9.1 \
    --model HCompany/Holo1-3B \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt 'image=3,video=0' \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --max-model-len 16384

Keep in mindTo run Holo1 7B, simply change —model to HCompany/Holo1-7B.

Invoking Holo1 via API

When the vLLM is running, you can send requests to:

http://localhost:8000/v1/chat/completions

Test with curl

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "HCompany/Holo1-3B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Test with Python (OpenAI SDK)

Install OpenAI client:

pip install openai

Example Python script:

from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "HCompany/Holo1-3B"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

print(chat_completion.choices[0].message.content)

Keep in mindThe API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.

Notes

--model can be set to HCompany/Holo1-3B or # HCompany/Holo1-7B
--gpus=all enables all NVIDIA GPUs for the container.
Holo1 is a multimodal model, so you can adjust image/video limits using --limit-mm-per-prompt.
Reduce --max-model-len or --gpu-memory-utilization if your GPU runs out of memory.
Ensure your GPU supports bfloat16 (e.g., A100, L40S, RTX 4090, etc.), use float16 otherwise.
Port 8000 must be free; change it with -p <host>:8000 if needed.

Examples

The endpoint is in service. You can use OpenAI Client to perform real time inference on the deployed Holo-1 model.

Models

Holo1.5

Holo1

Deploy via Docker

Run a vLLM server locally

Pre-requisites

Installation

Example