Skip to main content
Local mode needs an OpenAI-compatible model server on your own machine. This guide covers llama.cpp on macOS and vLLM on DGX Spark, including the launch flags that matter. Open-weight weights live in the H Company Hugging Face org. Holo3 122B is hosted API-only. Use holo3-122b-a10b in hosted mode rather than downloading local weights. Good starting model IDs for local servers:
ModelExample model ID
Holo 3.1 35BHcompany/Holo-3.1-35B-A3B
Holo 3.1 35B GGUFHcompany/Holo-3.1-35B-A3B-GGUF
Holo 3.1 35B NVFP4Hcompany/Holo-3.1-35B-A3B-NVFP4
For native performance on macOS with Metal GPU acceleration, use llama.cpp with the open-weight Q4_K_M GGUF weights, which balance precision and speed. A MacBook Pro or Max with an M3 chip or newer and at least 36 GB of unified memory is recommended: the Q4_K_M weights use roughly 21 GB, and prefix caching improves performance but needs extra memory to pre-allocate the KV cache. Lower --cache-ram and --ctx-size to reduce memory use.Install llama.cpp:
brew install llama.cpp
Start the server:
llama-server -hf Hcompany/Holo-3.1-35B-A3B-GGUF
For better efficiency, apply these tuned parameters:
llama-server \
  --hf Hcompany/Holo-3.1-35B-A3B-GGUF \
  --n-gpu-layers 999 \
  --ctx-size 65536 \
  --batch-size 16384 \
  --ubatch-size 2048 \
  --flash-attn 1 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --image-min-tokens 1024 \
  --ctx-checkpoints 8 \
  --cache-ram 32768 \
  --kv-unified \
  --threads 16
llama.cpp serves on port 8080 by default, so the base URL is http://localhost:8080/v1. Any string works as the --model value.

Connect HoloDesktop CLI

With your server running, open a new terminal and point HoloDesktop CLI at it:
holo run "Open TextEdit and write a short note saying HoloDesktop CLI is installed" \
  --base-url http://localhost:8080/v1 \
  --model holo3-1-35b
Local mode does not require holo login.

What’s next

Wire local mode into hosts with the environment variables in Hosted or local models, then run your first task with the Quickstart.