holo3-122b-a10b in hosted mode rather than downloading local weights.
Good starting model IDs for local servers:
| Model | Example model ID |
|---|---|
| Holo 3.1 35B | Hcompany/Holo-3.1-35B-A3B |
| Holo 3.1 35B GGUF | Hcompany/Holo-3.1-35B-A3B-GGUF |
| Holo 3.1 35B NVFP4 | Hcompany/Holo-3.1-35B-A3B-NVFP4 |
- macOS (llama.cpp)
- DGX Spark (vLLM)
For native performance on macOS with Metal GPU acceleration, use llama.cpp with the open-weight Start the server:For better efficiency, apply these tuned parameters:llama.cpp serves on port
Q4_K_M GGUF weights, which balance precision and speed. A MacBook Pro or Max with an M3 chip or newer and at least 36 GB of unified memory is recommended: the Q4_K_M weights use roughly 21 GB, and prefix caching improves performance but needs extra memory to pre-allocate the KV cache. Lower --cache-ram and --ctx-size to reduce memory use.Install llama.cpp:8080 by default, so the base URL is http://localhost:8080/v1. Any string works as the --model value.Connect HoloDesktop CLI
With your server running, open a new terminal and point HoloDesktop CLI at it:- llama.cpp
- vLLM
holo login.