Run a local model server

Local mode needs an OpenAI-compatible model server on your own machine. This guide covers llama.cpp on macOS and vLLM on DGX Spark, including the launch flags that matter. Open-weight weights live in the H Company Hugging Face org. Holo3 122B is hosted API-only. Use holo3-122b-a10b in hosted mode rather than downloading local weights. Good starting model IDs for local servers:

Model	Example model ID
Holo 3.1 35B	`Hcompany/Holo-3.1-35B-A3B`
Holo 3.1 35B GGUF	`Hcompany/Holo-3.1-35B-A3B-GGUF`
Holo 3.1 35B NVFP4	`Hcompany/Holo-3.1-35B-A3B-NVFP4`

macOS (llama.cpp)
DGX Spark (vLLM)

For native performance on macOS with Metal GPU acceleration, use llama.cpp with the open-weight Q4_K_M GGUF weights, which balance precision and speed. A MacBook Pro or Max with an M3 chip or newer and at least 36 GB of unified memory is recommended: the Q4_K_M weights use roughly 21 GB, and prefix caching improves performance but needs extra memory to pre-allocate the KV cache. Lower --cache-ram and --ctx-size to reduce memory use.Install llama.cpp:

brew install llama.cpp

Start the server:

llama-server -hf Hcompany/Holo-3.1-35B-A3B-GGUF

For better efficiency, apply these tuned parameters:

llama-server \
  --hf Hcompany/Holo-3.1-35B-A3B-GGUF \
  --n-gpu-layers 999 \
  --ctx-size 65536 \
  --batch-size 16384 \
  --ubatch-size 2048 \
  --flash-attn 1 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --image-min-tokens 1024 \
  --ctx-checkpoints 8 \
  --cache-ram 32768 \
  --kv-unified \
  --threads 16

llama.cpp serves on port 8080 by default, so the base URL is http://localhost:8080/v1. Any string works as the --model value.

On DGX Spark, use the latest stable vLLM image for aarch64 (v0.23.0): vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404. The open-weight NVFP4 weights use the Blackwell architecture’s NVFP4 support for a good speed and precision tradeoff on GB10 GPUs.Pull the image:

docker pull vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404

Launch the server:

docker run -d --gpus all \
  --shm-size=16g \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404 \
  vllm serve Hcompany/Holo-3.1-35B-A3B-NVFP4 \
  --served-model-name holo3-1-35b \
  --host 0.0.0.0

For better efficiency, apply these tuning parameters:

docker run -d --gpus all \
  --shm-size=16g \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404 \
  vllm serve Hcompany/Holo-3.1-35B-A3B-NVFP4 \
  --served-model-name holo3-1-35b \
  --host 0.0.0.0 \
  --gpu-memory-utilization 0.8 \
  --max-model-len 65537 \
  --max-num-batched-tokens 32768 \
  --chat-template-content-format openai \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --limit-mm-per-prompt '{"image": 5, "video": 0}' \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --mm-processor-cache-gb 15

vLLM serves on port 8000 by default, so the base URL is http://localhost:8000/v1. The --served-model-name you set (holo3-1-35b) is the model ID you must pass to HoloDesktop CLI.

Connect HoloDesktop CLI

With your server running, open a new terminal and point HoloDesktop CLI at it:

llama.cpp
vLLM

holo run "Open TextEdit and write a short note saying HoloDesktop CLI is installed" \
  --base-url http://localhost:8080/v1 \
  --model holo3-1-35b

holo run "Open TextEdit and write a short note saying HoloDesktop CLI is installed" \
  --base-url http://localhost:8000/v1 \
  --model holo3-1-35b

Local mode does not require holo login.

What’s next

Wire local mode into hosts with the environment variables in Hosted or local models, then run your first task with the Quickstart.

​Connect HoloDesktop CLI

​What’s next

Connect HoloDesktop CLI

What’s next