Skip to main content
HoloDesktop CLI needs a model backend before it can run desktop tasks. You can use H Company’s hosted Models API, or run a model yourself and point the CLI at your OpenAI-compatible server. Hosted mode is the default. Local mode starts when you provide a local base URL.

Hosted mode

Hosted mode calls an H Company hosted model. Sign in once from the holo-desktop checkout:
uv run holo login
This opens the H Company Portal in your browser. After sign-in, HoloDesktop CLI writes a hosted API key to:
~/.holo/.env
Check that the key is available:
uv run holo whoami
If you run holo run from an interactive terminal without a key, HoloDesktop CLI can start browser login for you. MCP and ACP hosts cannot do that during startup because they launch the CLI non-interactively. Sign in from a terminal before you wire the CLI into a host.
MCP and ACP hosts cannot complete browser login during startup. If you plan to use hosted mode from a host, run uv run holo login in a terminal first, then restart the host so it can see the saved key.
You can also provide HAI_API_KEY through the process environment. When the key comes from the environment, holo whoami may not have a cached Portal identity to print. To pick a hosted model explicitly, pass its API model ID:
uv run holo run \
  --model holo3-1-35b-a3b \
  "Open TextEdit and write a short note saying HoloDesktop CLI is installed"
For the larger Holo3 hosted model, use:
uv run holo run \
  --model holo3-122b-a10b \
  "Open TextEdit and write a short note saying HoloDesktop CLI is installed"

Local mode

Use local mode to run one of H Company’s open-weight models yourself. Start from the H Company Hugging Face org, which lists Holo and Holotron model releases. HoloDesktop has two parts:
  • The harness — the core logic that lets the agent carry out tasks for you through H Company’s models.
  • The inference server — the runtime that serves Holo3.1 models.
You can point the harness at H Company’s hosted Models API or run the inference server yourself with a local engine such as vLLM or llama.cpp on supported hardware. The rest of this section walks through configuring a local inference server for a fully private, on-device agent. Good starting model IDs for local servers are:
ModelExample model ID
Holo 3.1 35BHcompany/Holo-3.1-35B-A3B
Holo 3.1 35B GGUFHcompany/Holo-3.1-35B-A3B-GGUF
Holo 3.1 35B NVFP4Hcompany/Holo-3.1-35B-A3B-NVFP4
Holo3 122B is hosted API-only. Use holo3-122b-a10b in hosted mode rather than trying to download local weights.
Local mode is selected by providing --base-url or HAI_AGENT_RUNTIME_BASE_URL. It does not require holo login when the endpoint is reachable.
Setting up local mode takes two steps:
  1. Start an inference server for your hardware: llama.cpp on macOS (Apple Silicon) or vLLM on NVIDIA DGX Spark.
  2. Point HoloDesktop at that server with --base-url and --model.
The sections below cover each step.

macOS with llama.cpp

For the best native performance on macOS with Metal GPU acceleration, use llama.cpp. For the Holo3.1 launch, H Company published open-weight GGUF weights in Q4_K_M, which balances precision and speed.
A MacBook Pro or Max with an M3 chip or newer and at least 36 GB of unified memory is recommended. The Q4_K_M weights use roughly 21 GB. Enabling prefix caching improves performance but needs extra memory to pre-allocate the KV cache. Lower --cache-ram and --ctx-size to reduce overall memory use.
Install llama.cpp:
brew install llama.cpp

Serve Holo3.1 locally

Start the local model server:
llama-server -hf Hcompany/Holo-3.1-35B-A3B-GGUF
For better efficiency, apply these tuned parameters:
llama-server \
  --hf Hcompany/Holo-3.1-35B-A3B-GGUF \
  --n-gpu-layers 999 \
  --ctx-size 65536 \
  --batch-size 16384 \
  --ubatch-size 2048 \
  --flash-attn 1 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --image-min-tokens 1024 \
  --ctx-checkpoints 8 \
  --cache-ram 32768 \
  --kv-unified \
  --threads 16
By default, llama.cpp serves on port 8080. When you connect HoloDesktop, any string works as the --model value.

DGX Spark with vLLM

On DGX Spark, use the latest stable vLLM image for aarch64 (v0.23.0): vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404. For Holo3.1, H Company published open-weight NVFP4 weights. Using the NVFP4 support in the Blackwell architecture, these weights give a good speed and precision tradeoff on GB10 GPUs.

Serve Holo3.1 on DGX Spark

Pull the vLLM image:
docker pull vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404
Launch the server:
docker run -d --gpus all \
  --shm-size=16g \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404 \
  vllm serve Hcompany/Holo-3.1-35B-A3B-NVFP4 \
  --served-model-name holo3-1-35b \
  --host 0.0.0.0
For better efficiency, apply these tuning parameters:
docker run -d --gpus all \
  --shm-size=16g \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404 \
  vllm serve Hcompany/Holo-3.1-35B-A3B-NVFP4 \
  --served-model-name holo3-1-35b \
  --host 0.0.0.0 \
  --gpu-memory-utilization 0.8 \
  --max-model-len 65537 \
  --max-num-batched-tokens 32768 \
  --chat-template-content-format openai \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --limit-mm-per-prompt '{"image": 5, "video": 0}' \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --mm-processor-cache-gb 15
By default, vLLM serves on port 8000. The --served-model-name you set here (holo3-1-35b) is the model ID you must pass to HoloDesktop when you connect.

Connect HoloDesktop to your local server

With your inference server running, open a new terminal and point HoloDesktop at it. HoloDesktop expects an OpenAI-compatible endpoint and needs two flags:
  • --base-url: the address of your local server.
    • llama.cpp serves at http://localhost:8080/v1 by default.
    • vLLM serves at http://localhost:8000/v1 by default.
  • --model: the model identifier.
    • With llama.cpp, any string works.
    • With vLLM, it must match the --served-model-name you set when launching the server (holo3-1-35b in the examples above).
Connect to a llama.cpp server:
uv run holo run "Open TextEdit and write a short note saying HoloDesktop CLI is installed" \
  --base-url http://localhost:8080/v1 \
  --model holo3-1-35b
Connect to a vLLM server:
uv run holo run "Open TextEdit and write a short note saying HoloDesktop CLI is installed" \
  --base-url http://localhost:8000/v1 \
  --model holo3-1-35b
Local mode does not require holo login.

Local mode from hosts

MCP and ACP hosts start HoloDesktop CLI over stdio, so they read model settings from the environment that launched the host. Terminal-launched hosts inherit shell exports. GUI apps launched from the Dock or Finder usually need the same values in the host’s MCP or ACP environment settings.
Shell exports usually do not reach GUI apps launched from the Dock or Finder. If local mode works in your terminal but fails in a host, put HAI_AGENT_RUNTIME_BASE_URL and HAI_AGENT_RUNTIME_MODEL in the host’s own MCP or ACP environment config.
Set the local server URL before the host starts the CLI:
export HAI_AGENT_RUNTIME_BASE_URL=http://localhost:8000/v1
If your server needs a model ID, set it too:
export HAI_AGENT_RUNTIME_MODEL=Hcompany/Holo-3.1-35B-A3B
When HAI_AGENT_RUNTIME_BASE_URL is set, MCP and ACP startup does not require HAI_API_KEY.

Which should I use?

Use hosted mode for the fastest setup. Use local mode for private inference, local model serving, or direct control over the model runtime.

What’s next

After hosted login succeeds or your local server is running, run your first task.