Hosted mode
Hosted mode calls an H Company hosted model. Sign in once from theholo-desktop checkout:
holo run from an interactive terminal without a key, HoloDesktop CLI can start browser login for you. MCP and ACP hosts cannot do that during startup because they launch the CLI non-interactively. Sign in from a terminal before you wire the CLI into a host.
You can also provide HAI_API_KEY through the process environment. When the key comes from the environment, holo whoami may not have a cached Portal identity to print.
To pick a hosted model explicitly, pass its API model ID:
Local mode
Use local mode to run one of H Company’s open-weight models yourself. Start from the H Company Hugging Face org, which lists Holo and Holotron model releases. HoloDesktop has two parts:- The harness — the core logic that lets the agent carry out tasks for you through H Company’s models.
- The inference server — the runtime that serves Holo3.1 models.
| Model | Example model ID |
|---|---|
| Holo 3.1 35B | Hcompany/Holo-3.1-35B-A3B |
| Holo 3.1 35B GGUF | Hcompany/Holo-3.1-35B-A3B-GGUF |
| Holo 3.1 35B NVFP4 | Hcompany/Holo-3.1-35B-A3B-NVFP4 |
holo3-122b-a10b in hosted mode rather than trying to download local weights.
Setting up local mode takes two steps:
- Start an inference server for your hardware: llama.cpp on macOS (Apple Silicon) or vLLM on NVIDIA DGX Spark.
- Point HoloDesktop at that server with
--base-urland--model.
macOS with llama.cpp
For the best native performance on macOS with Metal GPU acceleration, use llama.cpp. For the Holo3.1 launch, H Company published open-weight GGUF weights inQ4_K_M, which balances precision and speed.
A MacBook Pro or Max with an M3 chip or newer and at least 36 GB of unified memory is recommended. The
Q4_K_M weights use roughly 21 GB. Enabling prefix caching improves performance but needs extra memory to pre-allocate the KV cache. Lower --cache-ram and --ctx-size to reduce overall memory use.Serve Holo3.1 locally
Start the local model server:8080. When you connect HoloDesktop, any string works as the --model value.
DGX Spark with vLLM
On DGX Spark, use the latest stable vLLM image for aarch64 (v0.23.0): vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404. For Holo3.1, H Company published open-weight NVFP4 weights. Using the NVFP4 support in the Blackwell architecture, these weights give a good speed and precision tradeoff on GB10 GPUs.
Serve Holo3.1 on DGX Spark
Pull the vLLM image:8000. The --served-model-name you set here (holo3-1-35b) is the model ID you must pass to HoloDesktop when you connect.
Connect HoloDesktop to your local server
With your inference server running, open a new terminal and point HoloDesktop at it. HoloDesktop expects an OpenAI-compatible endpoint and needs two flags:--base-url: the address of your local server.- llama.cpp serves at
http://localhost:8080/v1by default. - vLLM serves at
http://localhost:8000/v1by default.
- llama.cpp serves at
--model: the model identifier.- With llama.cpp, any string works.
- With vLLM, it must match the
--served-model-nameyou set when launching the server (holo3-1-35bin the examples above).
holo login.
Local mode from hosts
MCP and ACP hosts start HoloDesktop CLI over stdio, so they read model settings from the environment that launched the host. Terminal-launched hosts inherit shell exports. GUI apps launched from the Dock or Finder usually need the same values in the host’s MCP or ACP environment settings. Set the local server URL before the host starts the CLI:HAI_AGENT_RUNTIME_BASE_URL is set, MCP and ACP startup does not require HAI_API_KEY.