Allen Institute for AI

MolmoWeb: Open Visual Web Agents from Screenshots Only

Fully open-weight visual agents navigate complex web environments using pure screenshots, hitting SOTA 94.7% on WebVoyager.

MolmoWeb proves you don’t need HTML trees or APIs for web agents—just good vision models eating screenshots. Open-weight VLMs click through messy sites at 94.7% pass@4 on WebVoyager, scaling linearly with compute. Bridges huge gap to closed models without infrastructure hacks.

What changed. Pure vision web agents match SOTA opens (94.7% WebVoyager) via screenshot scaling—no HTML required.

Why it matters. Unlocks deployable web automation for any VLM, kills parser dependencies.

Builder takeaway. Screenshot your browser, feed to VLMs—robust tooling without DOM fragility. Paper

The Agent Brief

Three things in agentic AI, every Tuesday.

What changed, what matters, what builders should do next. No hype. No paid placement.