Models and families
| Term | Definition |
|---|---|
| Holo3.1 | Latest generation Vision-Language Model (VLM) family for GUI agents that interact with real digital environments (web, desktop, mobile). |
| Holo3.1 family | Model sizes from 0.8B to 35B-A3B, spanning on-device to state-of-the-art deployments. |
| Holo3.1-35B-A3B | Open-source (Apache 2.0) model variant, available in BF16, FP8, NVFP4, and Q4 GGUF for cloud and local inference. |
| Holo3 | Prior generation that Holo3.1 builds on. |
| Holo2 | Earlier generation model that Holo3 improved upon. |
| Qwen/Qwen3.5-35B-A3B | Base model used for fine-tuning Holo3.1-35B-A3B. |
| Surfer-H | Example next-generation computer-use agent built on the Holo model family. |
Capabilities and tasks
| Term | Definition |
|---|---|
| Vision-Language Model (VLM) | A model that understands both visual inputs (like UI screens) and text, enabling it to interpret interfaces and perform actions. |
| GUI Agents | AI agents that operate graphical user interfaces by observing screens, reasoning about them, and executing actions. |
| Computer Use (CU) | The ability of an AI system to perform tasks on a computer, such as navigating interfaces and executing commands. |
| Navigation (in AI agents) | The process of completing tasks through multi-step reasoning and actions across interfaces. |
| Element Localization | Single-turn vision task: given a screenshot and a text description of a target UI element, return click coordinates. A grounding primitive that can be used inside larger agent harnesses. |
| Action Grounding | Connecting model decisions to actual executable actions in an environment. |
| Cross-environment Generalization | Ability to perform well across different platforms (web, desktop, mobile), including unseen environments. |
Benchmarks
| Term | Definition |
|---|---|
| OSWorld | Benchmark evaluating performance in real Ubuntu desktop environments. |
| WebVoyager / WebArena | Benchmarks for testing web navigation and task completion abilities. |
| AndroidWorld | Benchmark for evaluating performance on mobile environments. |
Training and methods
| Term | Definition |
|---|---|
| Policy Learning | Training method where the model learns which actions to take in different situations. |
| Supervised Fine-Tuning (SFT) | Training stage where the model learns from labeled examples. |
| Reinforcement Learning (GRPO) | Training method where the model improves through feedback based on its actions. |
| Synthetic Data | Artificially generated data used to supplement training. |
| Human-Annotated Data | Data labeled by humans to improve model accuracy. |
| State-of-the-Art (SOTA) | Performance that is among the best currently available. |