Run a vLLM server locally
This section shows you how to install the vLLM run the server locally.Pre-requisites
- An NVIDIA GPU with drivers installed
Installation
- Install the vLLM using the instructions provided by vLLM
- Install a compatible version of
transformers
:
Example
The snippet below shows you how to run Holo1 3BDeploy via Docker
This section shows you how to deploy our models via Docker container.Pre-requisites
- An NVIDIA GPU with drivers installed
- NVIDIA Container Toolkit to allow Docker to access your GPU
- Docker installed and running
Example
The command below shows you how to run a holo model.Keep in mindTo run a different Holo model, simply change —model to HCompany/Holo1-7B, for example.
Invoking Holo via API
When the vLLM is running, you can send requests to:Test with curl
Test with Python (OpenAI SDK)
- Install OpenAI client:
- Example Python script:
Keep in mindThe API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.
Notes
--model
can be set toHCompany/Holo1-3B
or# HCompany/Holo1-7B
or any other applicable Holo model available at the time.--gpus=all
enables all NVIDIA GPUs for the container.- Our Holo models are multimodal, so you can adjust image/video limits using
--limit-mm-per-prompt
. - Reduce
--max-model-len
or--gpu-memory-utilization
if your GPU runs out of memory. - Ensure your GPU supports bfloat16 (e.g., A100, L40S, RTX 4090, etc.), use float16 otherwise.
- Port 8000 must be free; change it with
-p <host>:8000
if needed.