The instructions below show you how to serve our models via Docker container.

Run a vLLM server locally

This section shows you how to install the vLLM run the server locally.

Pre-requisites

  • An NVIDIA GPU with drivers installed

Installation

  1. Install the vLLM using the instructions provided by vLLM
  2. Install a compatible version of transformers:
pip install "transformers<4.53.0"

Example

The snippet below shows you how to run Holo1 3B
vllm serve Hcompany/Holo1-3B \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt 'image=3,video=0' \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --max-model-len 16384

Deploy via Docker

This section shows you how to deploy our models via Docker container.

Pre-requisites

Example

The command below shows you how to Run Holo1 3B
docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.9.1 \
    --model HCompany/Holo1-3B \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt 'image=3,video=0' \
    --mm-processor-kwargs '{"max_pixels": 1003520}' \
    --max-model-len 16384
Keep in mindTo run Holo1 7B, simply change —model to HCompany/Holo1-7B.

Invoking Holo1 via API

When the vLLM is running, you can send requests to:
http://localhost:8000/v1/chat/completions

Test with curl

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "HCompany/Holo1-3B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Test with Python (OpenAI SDK)

  1. Install OpenAI client:
pip install openai
  1. Example Python script:
from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "HCompany/Holo1-3B"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

print(chat_completion.choices[0].message.content)
Keep in mindThe API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.

Notes

  • --model can be set to HCompany/Holo1-3B or # HCompany/Holo1-7B
  • --gpus=all enables all NVIDIA GPUs for the container.
  • Holo1 is a multimodal model, so you can adjust image/video limits using --limit-mm-per-prompt.
  • Reduce --max-model-len or --gpu-memory-utilization if your GPU runs out of memory.
  • Ensure your GPU supports bfloat16 (e.g., A100, L40S, RTX 4090, etc.), use float16 otherwise.
  • Port 8000 must be free; change it with -p <host>:8000 if needed.

Examples

The endpoint is in service. You can use OpenAI Client to perform real time inference on the deployed Holo-1 model.