Run a vLLM server locally
First, install the vLLM using the instructions provided by vLLM. You can then launch the vLLM from the command line, for example:Good to know: to disable thinking mode,
--reasoning-parser argument needs to be removed.Deploy via Docker
First, make sure you’ve met the following pre-requisites:- An NVIDIA GPU with drivers installed
- NVIDIA Container Toolkit to allow Docker to access your GPU
- Docker installed and running
Good to know
- To disable thinking mode,
--reasoning-parserargument needs to be removed. - To run Holo2 8B, change —model to HCompany/Holo2-8B.
- To run Holo2 30B A3B, change —model to HCompany/Holo2-30B-3AB and add —tensor-parallel-size 2
Holo2 reasoning parser compatibility
Holo2 models are reasoning models. In order to extract reasoning content for a request, we need to set the--reasoning-parser accordingly in vllm (docker or vllm serve).
Holo2 chat template is configurable to enable and disable thinking. By default, Holo2 is in thinking mode. To configure thinking mode at the request level:
{"chat_template_kwargs": {"thinking": false }}
Invoking Holo2 via API
When vLLM is running, you can send requests to:Test with curl: thinking mode
Test with curl: thinking mode disabled
Test with Python (OpenAI SDK)
First, install the OpenAI client.Good to know
- The API key is not used by vLLM, but required by the OpenAI SDK — use “EMPTY” as a placeholder.
--modelcan be set toHCompany/Holo2-4B,HCompany/Holo2-8BorHCompany/Holo2-30B-A3B--gpus=allenables all NVIDIA GPUs for the container.- Holo2 is a multimodal model, so you can adjust image limits using
--limit-mm-per-prompt. - Reduce
--max-model-lenor--gpu-memory-utilizationif your GPU runs out of memory. - Ensure your GPU supports bfloat16 (e.g., H100, A100, L40S, RTX 4090, etc.), use float16 otherwise.
- Port 8000 must be free; change it with
-p <host>:8000if needed.
.png)
