Skip to main content
This guide shows you how to deploy the Holo2 model with vLLM, on NVIDIA GPUs.

Run a vLLM server locally

First, install the vLLM using the instructions provided by vLLM. You can then launch the vLLM from the command line, for example:
vllm serve Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=deepseek_r1
    --limit-mm-per-prompt={"image": 3, "video": 0}
Good to know: to disable thinking mode, --reasoning-parser argument needs to be removed.

Deploy via Docker

First, make sure you’ve met the following pre-requisites: Next, run the Holo2 model, for example:
docker run -it --gpus=all --rm -p 8000:8000 vllm/vllm-openai:v0.11.0 \
    --model Hcompany/Holo2-4B \
    --dtype bfloat16 \
    --max-model-len=65536 \
    --reasoning-parser=deepseek_r1
    --limit-mm-per-prompt={"image": 3, "video": 0}
Good to know
  • To disable thinking mode, --reasoning-parser argument needs to be removed.
  • To run Holo2 8B, change —model to HCompany/Holo2-8B.
  • To run Holo2 30B A3B, change —model to HCompany/Holo2-30B-3AB and add —tensor-parallel-size 2

Holo2 reasoning parser compatibility

Holo2 models are reasoning models. In order to extract reasoning content for a request, we need to set the --reasoning-parser accordingly in vllm (docker or vllm serve). Holo2 chat template is configurable to enable and disable thinking. By default, Holo2 is in thinking mode. To configure thinking mode at the request level: {"chat_template_kwargs": {"thinking": false }}

Invoking Holo2 via API

When vLLM is running, you can send requests to:
http://localhost:8000/v1/chat/completions

Test with curl: thinking mode

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "HCompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Test with curl: thinking mode disabled

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "HCompany/Holo2-4B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ],
        "chat_template_kwargs": {
            "thinking": false 
        }
    }'

Test with Python (OpenAI SDK)

First, install the OpenAI client.
pip install openai
Next, run the Python script, for example:
from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"
API_KEY = "EMPTY"
MODEL = "HCompany/Holo2-4B"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY
)

# With deepseek_r1 reasoning parser (thinking mode)
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

print(chat_completion.choices[0].message.content)

# Without reasoning parser (no thinking mode)
chat_completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    extra_body={"chat_template_kwargs": {"thinking": False }}
)
Good to know
  • The API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.
  • --model can be set to HCompany/Holo2-4B, HCompany/Holo2-8B or HCompany/Holo2-30B-A3B
  • --gpus=all enables all NVIDIA GPUs for the container.
  • Holo2 is a multimodal model, so you can adjust image limits using --limit-mm-per-prompt.
  • Reduce --max-model-len or --gpu-memory-utilization if your GPU runs out of memory.
  • Ensure your GPU supports bfloat16 (e.g., H100, A100, L40S, RTX 4090, etc.), use float16 otherwise.
  • Port 8000 must be free; change it with -p <host>:8000 if needed.