Model | Size | Tensor type | General purpose | Use case |
---|---|---|---|---|
Holo1 7B | 8.29B params | BF16 | Higher accuracy and for large scale inference | Full-scale tasks |
Holo1 3B | 3.75B params | BF16 | Optimized for efficiency, running locally and hardware | Common tasks |
- Developed by: H Company
- Model type: Action Vision-Language Model
- Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct
- Research paper
- Blog post
- License
Results
Surfer-H: Pareto-Optimal Performance on WebVoyager
Surfer-H is designed to be flexible and modular. It is composed of three independent components:- A Policy model that plans, decides, and drives the agent’s behavior
- A Localizer model that sees and understands visual UIs to drive precise interactions
- A Validator model that checks whether the answer is valid

- Surfer-H + Holo1-7B: 92.2% accuracy at $0.13 per task
- Surfer-H + GPT-4.1: 92.0% at $0.54 per task
- Surfer-H + Holo1-3B: 89.7% at $0.11 per task
- Surfer-H + GPT-4.1-mini: 88.8% at $0.26 per task
Holo1: State-of-the-Art UI Localization
A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise coordinates on a user interface (UI) to interact with to complete a task or follow an instruction. To assess this capability, we evaluated our Holo1 models on several established localization benchmarks, including Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and our own newly introduced benchmark WebClick.