mode field sets what that observation contains and which actions are on the table.
| Mode | What the agent sees | How it acts | Reach for it when |
|---|---|---|---|
visual (default) | A screenshot of the viewport, with the current URL and open tabs. | Points at on-screen targets; the platform resolves each one to exact click and type coordinates. | General web work: clicking, filling forms, anything that needs the rendered page. |
multimodal | The screenshot and the page’s text in the same observation. | The same coordinate actions as visual. | Pages where reading the text alongside the screenshot helps the agent decide. |
text | The page as text only, no screenshot, split into chunks of about page_chars characters. | Reads and pages through the chunks and follows links by URL. Read-only: no clicking or typing. | Reading and research at scale: search, scraping, link-heavy navigation. |
- Cost and speed track the screenshots.
textmode sends no images, so it uses the fewest tokens and runs fastest;multimodalsends the most (a screenshot plus the full page text every step);visualsits in between. Default tovisual, and switch totextwhen the task is pure reading. textmode pages through long content. Rather than scrolling, the page is cut intopage_chars-sized chunks; the agent moves between them and each observation tells it which chunk it is on. Raisepage_charsto fit more per step, at the cost of more tokens per observation.- Watch what the agent saw. Every observation rides the event stream, as a
webobservation invisual/multimodaland atextual_webobservation intext, so you can replay each step exactly as the agent perceived it.
Actions
Each mode also fixes the set of actions available to the agent. It chooses them autonomously as it works; you never call them directly and there is no per-agent tool list to configure. To shape how it uses them, set the agent’sinstructions.
| Action | Description | Visual & multimodal | Text |
|---|---|---|---|
go_to_web | Navigate to a URL. | ✓ | ✓ |
go_back_web | Go back in the browser history. | ✓ | ✓ |
refresh_web | Refresh the current page. | ✓ | ✓ |
switch_tab_web | Switch to another tab, or open a new one. | ✓ | ✓ |
close_tab_web | Close a tab. | ✓ | ✓ |
click_web | Click at viewport coordinates. | ✓ | |
write | Focus an input at coordinates and type into it. | ✓ | |
fill_secret_at | Fill a vault-resolved secret (e.g. password, totp) into a field at coordinates so the agent can sign in on your behalf. The value is injected directly into the page and never enters the agent’s context. Offered only when a vault is bound to the browser via vault_id and can match a credential for the current page. | ✓ | |
select_option | Pick an option from a native <select> dropdown. | ✓ | |
move_mouse_web | Move the mouse to reveal hovers, tooltips, or menus. | ✓ | |
press_keys_web | Press keys or keyboard shortcuts. | ✓ | |
scroll_web | Scroll the page or a nested scrollable container. | ✓ | |
ctrl_f_web | Jump to the next on-page match of a text query. | ✓ | |
reader_mode | Extract the page’s main content as clean markdown. | ✓ | |
find_in_page | Find a text query and jump to the chunk that contains it. | ✓ | |
switch_chunk | Page forward or backward through the page’s text chunks. | ✓ | |
wait_web | Pause for the page to settle (up to 60 seconds). | ✓ | ✓ |