The instructions below show you how to serve our models locally.

Installing the vLLM

Use the instructions provided by vLLM to install the vLLM. https://docs.vllm.ai/en/latest/getting_started/installation/index.html (we use version 0.8.4 and cuda 12.8, which we installed with pip install vllm==0.8.4 --extra-index-url https://download.pytorch.org/whl/cu128)
  1. pip install "transformers<4.53.0" to get the correct version of transformers

Serving a model

You can either serve a single model or multiple models.

Serving a single model

This is how you serve Holo1-3B
vllm serve Hcompany/Holo1-3B --port 8081
In this case the server address and api-key will default to:
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8081/v1"
You can test it out with this example (adapted from the vLLM Quickstart guide)
curl http://localhost:8081/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "Hcompany/Holo1-3B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
The above request should return:
{
    "id":"chatcmpl-7ee08d9882c44adbac1872c9073d81da",
    "object":"chat.completion",
    "created":1751378545,
    "model":"Hcompany/Holo1-3B",
    "choices":[
        {
            "index":0,
            "message":{
                "role":"assistant",
                "reasoning_content":null,
                "content":"The New York Yankees won the World Series in 2020.",
                "tool_calls":[]},
                "logprobs":null,
                "finish_reason":"stop",
                "stop_reason":null
                }
            ],
    "usage":{
        "prompt_tokens":31,
        "total_tokens":47,
        "completion_tokens":16,
        "prompt_tokens_details":null
    },
    "prompt_logprobs":null}

Serving more than one model

This is only necessary if you want different components of your agent to use different models that you also happen to host localyl. In this case, you will need more than one GPU as the memory requirements per model is likely to exceed the memory of one GPU. To achieve this, we limit the visible GPUs to the one we want to use. We also need to give each model its own port using --port, in order for the requests to be routed correctly.

Step 1: Launch the first model

Run the following command in your terminal to start the first model server:
CUDA_VISIBLE_DEVICES=0 vllm serve Hcompany/Holo1-3B --port 8081

Step 2: Launch the second model

Run the following command in your terminal to start the second model server:
# this launches the second model
CUDA_VISIBLE_DEVICES=1 vllm serve Hcompany/Holo1-7B --port 8082
# query the first model
curl http://localhost:8081/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "Hcompany/Holo1-3B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

# query the second model model
curl http://localhost:8082/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "Hcompany/Holo1-7B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
ModificationsYou can adjust things like tensor and pipeline-parallelism, among other things, to tweak the inference speed of your model(s). Please refer to the vLLM documentation for more details.