Modes - H Tech Hub

At each step the agent receives a fresh observation of the page, then chooses one action. The mode field sets what that observation contains and which actions are on the table.

Mode	What the agent sees	How it acts	Reach for it when
`visual` (default)	A screenshot of the viewport, with the current URL and open tabs.	Points at on-screen targets; the platform resolves each one to exact click and type coordinates.	General web work: clicking, filling forms, anything that needs the rendered page.
`multimodal`	The screenshot and the page’s text in the same observation.	The same coordinate actions as `visual`.	Pages where reading the text alongside the screenshot helps the agent decide.
`text`	The page as text only, no screenshot, split into chunks of about `page_chars` characters.	Reads and pages through the chunks and follows links by URL. Read-only: no clicking or typing.	Reading and research at scale: search, scraping, link-heavy navigation.

A few things that follow from the table:

Cost and speed track the screenshots. text mode sends no images, so it uses the fewest tokens and runs fastest; multimodal sends the most (a screenshot plus the full page text every step); visual sits in between. Default to visual, and switch to text when the task is pure reading.
text mode pages through long content. Rather than scrolling, the page is cut into page_chars-sized chunks; the agent moves between them and each observation tells it which chunk it is on. Raise page_chars to fit more per step, at the cost of more tokens per observation.
Watch what the agent saw. Every observation rides the event stream, as a web observation in visual/multimodal and a textual_web observation in text, so you can replay each step exactly as the agent perceived it.

Actions

Each mode also fixes the set of actions available to the agent. It chooses them autonomously as it works; you never call them directly and there is no per-agent tool list to configure. To shape how it uses them, set the agent’s instructions.

Action	Description	Visual & multimodal	Text
`go_to_web`	Navigate to a URL.	✓	✓
`go_back_web`	Go back in the browser history.	✓	✓
`refresh_web`	Refresh the current page.	✓	✓
`switch_tab_web`	Switch to another tab, or open a new one.	✓	✓
`close_tab_web`	Close a tab.	✓	✓
`click_web`	Click at viewport coordinates.	✓
`write`	Focus an input at coordinates and type into it.	✓
`fill_secret_at`	Fill a vault-resolved secret (e.g. `password`, `totp`) into a field at coordinates so the agent can sign in on your behalf. The value is injected directly into the page and never enters the agent’s context. Offered only when a vault is bound to the browser via `vault_id` and can match a credential for the current page.	✓
`select_option`	Pick an option from a native `<select>` dropdown.	✓
`move_mouse_web`	Move the mouse to reveal hovers, tooltips, or menus.	✓
`press_keys_web`	Press keys or keyboard shortcuts.	✓
`scroll_web`	Scroll the page or a nested scrollable container.	✓
`ctrl_f_web`	Jump to the next on-page match of a text query.	✓
`reader_mode`	Extract the page’s main content as clean markdown.	✓
`find_in_page`	Find a text query and jump to the chunk that contains it.		✓
`switch_chunk`	Page forward or backward through the page’s text chunks.		✓
`wait_web`	Pause for the page to settle (up to 60 seconds).	✓	✓

​Actions

Actions