Our models are trained using high-quality proprietary data for UI understanding and action prediction, following a multi-stage training pipeline. The training dataset is a carefully curated mix of open-source datasets, large-scale synthetic data, learnings from agent executions, and human-annotated samples.

Training process

Our training follows 2 key phases:
  • Large-scale supervised fine-tuning: The model learns from labeled data to predict actions accurately.
  • Online reinforcement learning (GRPO): The model is further refined through interaction with environments, optimizing performance on real-world tasks.
The resulting Holo1.5 models are natively high-resolution (up to 3840 × 2160 pixels), capable of interpreting complex UIs and performing actions with high accuracy and efficiency.

Benchmarks

Benchmarks provide an objective way of measuring model capabilities. Our Holo models are assessed based on benchmarks in UI localization and screen content understanding.

UI Localization

These benchmarks evaluate an agent’s ability to locate elements on a screen (buttons, text boxes, images, etc.) precisely. This is critical for agents performing interactions in GUIs.

Holo1.5

Tested on Screenspot-V2, Screenspot-Pro, GroundUI-Web, Showdown, and WebClick.
  • 7B and 72B models achieve an average 4.5% improvement over prior models.
  • 7B and 3B models remain competitive with competing models one class above in weight size, delivering the promise of fast and efficient inference in UI Localization.

Holo1

Evaluated on Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and WebClick, demonstrating strong localization capabilities in real-world scenarios.

Screen Content Understanding (Holo1.5)

Beyond localization, understanding UI structure and functionality is essential. Holo1.5 was evaluated on GUI-focused QA benchmarks:
  • ScreenQA Short & Complex
  • VisualWebBench
  • WebSRC
These benchmarks measure the model’s ability to interpret UIs and reason about elements to complete tasks accurately. Holo1.5 models are also the top performers in UI understanding, from big models (SOTA), down to the smallest 3B model that’s competitive with large-scale alternative.

Surfer-H: Modular web task performance

Our Surfer-H agent was evaluated on WebVoyager, a benchmark with 643 real-world web tasks, Surfer-H demonstrates Pareto-optimal performance, balancing accuracy, speed, and cost based on the model configuration.