Skip to main content
At each step the agent receives a fresh observation of the page, then chooses one action. The mode field sets what that observation contains and which actions are on the table.
ModeWhat the agent seesHow it actsReach for it when
visual (default)A screenshot of the viewport, with the current URL and open tabs.Points at on-screen targets; the platform resolves each one to exact click and type coordinates.General web work: clicking, filling forms, anything that needs the rendered page.
multimodalThe screenshot and the page’s text in the same observation.The same coordinate actions as visual.Pages where reading the text alongside the screenshot helps the agent decide.
textThe page as text only, no screenshot, split into chunks of about page_chars characters.Reads and pages through the chunks and follows links by URL. Read-only: no clicking or typing.Reading and research at scale: search, scraping, link-heavy navigation.
A few things that follow from the table:
  • Cost and speed track the screenshots. text mode sends no images, so it uses the fewest tokens and runs fastest; multimodal sends the most (a screenshot plus the full page text every step); visual sits in between. Default to visual, and switch to text when the task is pure reading.
  • text mode pages through long content. Rather than scrolling, the page is cut into page_chars-sized chunks; the agent moves between them and each observation tells it which chunk it is on. Raise page_chars to fit more per step, at the cost of more tokens per observation.
  • Watch what the agent saw. Every observation rides the event stream, as a web observation in visual/multimodal and a textual_web observation in text, so you can replay each step exactly as the agent perceived it.

Actions

Each mode also fixes the set of actions available to the agent. It chooses them autonomously as it works; you never call them directly and there is no per-agent tool list to configure. To shape how it uses them, set the agent’s instructions.
ActionDescriptionVisual & multimodalText
go_to_webNavigate to a URL.
go_back_webGo back in the browser history.
refresh_webRefresh the current page.
switch_tab_webSwitch to another tab, or open a new one.
close_tab_webClose a tab.
click_webClick at viewport coordinates.
writeFocus an input at coordinates and type into it.
fill_secret_atFill a vault-resolved secret (e.g. password, totp) into a field at coordinates so the agent can sign in on your behalf. The value is injected directly into the page and never enters the agent’s context. Offered only when a vault is bound to the browser via vault_id and can match a credential for the current page.
select_optionPick an option from a native <select> dropdown.
move_mouse_webMove the mouse to reveal hovers, tooltips, or menus.
press_keys_webPress keys or keyboard shortcuts.
scroll_webScroll the page or a nested scrollable container.
ctrl_f_webJump to the next on-page match of a text query.
reader_modeExtract the page’s main content as clean markdown.
find_in_pageFind a text query and jump to the chunk that contains it.
switch_chunkPage forward or backward through the page’s text chunks.
wait_webPause for the page to settle (up to 60 seconds).