Introduction

Holo1 is an Action Vision-Language Model (VLM) developed by H Company for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user. As part of a broader agentic architecture, Holo1 acts as a policy, localizer, or validator, helping the agent understand and act in digital environments. Trained on a mix of open-access, synthetic, and self-generated data, Holo1 enables state-of-the-art (SOTA) performance on the WebVoyager benchmark, offering the best accuracy/cost tradeoff among current models. It also excels in UI localization tasks such as Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and our own newly introduced benchmark WebClick. Holo1 is optimized for both accuracy and cost-efficiency, making it a strong open-source alternative to existing VLMs. The Holo1 family comes in three model sizes to fit different deployment needs:

Model	Size	Tensor type	General purpose	Use case
Holo1 7B	8.29B params	BF16	Higher accuracy and for large scale inference	Full-scale tasks
Holo1 3B	3.75B params	BF16	Optimized for efficiency, running locally and hardware	Common tasks

For more details, check our paper and our blog post.

Developed by: H Company
Model type: Action Vision-Language Model
Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct
Research paper
Blog post
License

Results

Surfer-H: Pareto-Optimal Performance on WebVoyager

Surfer-H is designed to be flexible and modular. It is composed of three independent components:

A Policy model that plans, decides, and drives the agent’s behavior
A Localizer model that sees and understands visual UIs to drive precise interactions
A Validator model that checks whether the answer is valid

The agent thinks before acting, takes notes, and can retry if its answer is rejected. It can operate with different models for each module, allowing for tradeoffs between accuracy, speed, and cost. We evaluated Surfer-H on the WebVoyager benchmark: 643 real-world web tasks ranging from retrieving prices to finding news or scheduling events.

We’ve tested multiple configurations, from GPT-4-powered agents to 100% open Holo1 setups. Among them, the fully Holo1-based agents offered the strongest tradeoff between accuracy and cost:

Surfer-H + Holo1-7B: 92.2% accuracy at $0.13 per task
Surfer-H + GPT-4.1: 92.0% at $0.54 per task
Surfer-H + Holo1-3B: 89.7% at $0.11 per task
Surfer-H + GPT-4.1-mini: 88.8% at $0.26 per task

This places Holo1-powered agents on the Pareto frontier, delivering the best accuracy per dollar. Unlike other agents that rely on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — just like a real user. Combined with Holo1, it becomes a powerful, general-purpose, cost-efficient web automation system.

Holo1: State-of-the-Art UI Localization

A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise coordinates on a user interface (UI) to interact with to complete a task or follow an instruction. To assess this capability, we evaluated our Holo1 models on several established localization benchmarks, including Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and our own newly introduced benchmark WebClick.

Get Started with the Model

We provide 2 spaces to experiment with Localization and Navigation:

We provide starter code for the localization task: i.e. image + instruction -> click coordinates We also provide code to reproduce screenspot evaluations: screenspot_eval.py

Prepare model, processor

Holo1 models are based on Qwen2.5-VL architecture, which comes with transformers support. Here we provide a simple usage example. You can load the model and the processor as follows:

import json
import os
from typing import Any, Literal

from transformers import AutoModelForImageTextToText, AutoProcessor

# default: Load the model on the available device(s)
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = AutoModelForImageTextToText.from_pretrained(
    "Hcompany/Holo1-3B",
    torch_dtype="auto",
    # torch_dtype=torch.bfloat16,
    # attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processor
processor = AutoProcessor.from_pretrained("Hcompany/Holo1-3B")
# The default range for the number of visual tokens per image in the model is 4-1280.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

# Helper function to run inference
def run_inference(messages: list[dict[str, Any]]) -> str:
    # Preparation for inference
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(
        text=[text],
        images=image,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Prepare image and instruction

WARNING: Holo1 is using absolute coordinates (number of pixels) and HuggingFace processor is doing image resize. To have matching coordinates, one needs to smart_resize the image.

from PIL import Image
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

# Prepare image and instruction
image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/main/calendar_example.jpg" 
image = Image.open(requests.get(image_url, stream=True).raw)

# Resize the image so that predicted absolute coordinates match the size of the image.
image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=image_processor.patch_size * image_processor.merge_size,
    min_pixels=image_processor.min_pixels,
    max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None)  # type: ignore

import json
from . import navigation

task = "Book a hotel in Paris on August 3rd for 3 nights"
prompt = navigation.get_navigation_prompt(task, image, step=1)
navigation_str = run_inference(prompt)[0]
navigation = navigation.NavigationStep(**json.loads(navigation_str))
print(navigation)
# Expected NavigationStep(note='', thought='I need to select the check-out date as August 3rd and then proceed to search for hotels.', action=ClickElementAction(action='click_element', element='August 3rd on the calendar', x=777, y=282))

Localization with click(x, y)

from . import localization

instruction = "Select July 14th as the check-out date"
prompt = localization.get_localization_prompt(image, instruction)
coordinates = run_inference(prompt)[0]
print(coordinates)
# Expected Click(352, 348)

Localization with Structured Output

We trained Holo1 as an Action VLM with extensive use of json and tool calls. Therefore, it can be queried reliably with structured output:

import json
from . import localization

instruction = "Select July 14th as the check-out date"
prompt = localization.get_localization_prompt_structured_output(image, instruction)
coordinates_structured_str = run_inference(prompt)[0]
coordinates_structured = localization.ClickAction(**json.loads(coordinates_structured_str))
print(coordinates_structured)
# Expected ClickAction(action='click', x=352, y=340)

Citation

BibTeX:

@misc{andreux2025surferhmeetsholo1costefficient,
      title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, 
      author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu},
      year={2025},
      eprint={2506.02865},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.02865}, 
}

Models

Holo1.5

Holo1

Results

Surfer-H: Pareto-Optimal Performance on WebVoyager

Holo1: State-of-the-Art UI Localization

Get Started with the Model

Prepare model, processor

Prepare image and instruction

Navigation with Structured Output

Localization with click(x, y)

Localization with Structured Output

Citation

Models

Holo1.5

Holo1

​Results

​Surfer-H: Pareto-Optimal Performance on WebVoyager

​Holo1: State-of-the-Art UI Localization

​Get Started with the Model

​Prepare model, processor

​Prepare image and instruction

​Navigation with Structured Output

​Localization with click(x, y)

​Localization with Structured Output

​Citation

Results

Surfer-H: Pareto-Optimal Performance on WebVoyager

Holo1: State-of-the-Art UI Localization

Get Started with the Model

Prepare model, processor

Prepare image and instruction

Navigation with Structured Output

Localization with click(x, y)

Localization with Structured Output

Citation