Step 1: Install required dependencies
First, import the required dependencies. These instructions were tested onPython >= 3.11
.
Step 2: Load model and processor
Our Holo1.5 family of VLMs includes three model sizes to cover hardware requirements for deployment: 3B, 7B, and 72B. All checkpoints are state-of-the-art (SOTA) with respect to model size.Step 3: Define actions and prompt
The model uses structured output to return precise navigation actions. This example specifically demonstrates the localization capability of our Holo1.5 model.Step 4: Download image
Download a sample screenshot and define the navigation task.
Step 5: Set task and prepare image
Resize the input image, build the task prompt, and process everything into model-ready inputs.Step 6: Run inference
Because structured output is not enabled and generation is stochastic, parsing the output might fail. For production we recommend vLLM for:- Faster inference performance
- Built-in structured output validation
- Reliable JSON schema compliance
Step 7: Visualize results
Display the model’s predicted click positions on the image, highlighting targets and annotations for easy interpretation.Example output

- Red cross (⊕): Exact pixel coordinates where the model wants to click
- Red circle: Visual boundary around the target area
- Yellow annotation: Description of what element the model identified
(947, 338)
, indicating precise localization within the complex calendar interface.