MolmoWeb: Open Visual Web Agents from Screenshots Only

Fully open-weight visual agents navigate complex web environments using pure screenshots, hitting SOTA 94.7% on WebVoyager.

May 14, 2026 Molmo Team View paper →

MolmoWeb proves you don’t need HTML trees or APIs for web agents—just good vision models eating screenshots. Open-weight VLMs click through messy sites at 94.7% pass@4 on WebVoyager, scaling linearly with compute. Bridges huge gap to closed models without infrastructure hacks.

What changed. Pure vision web agents match SOTA opens (94.7% WebVoyager) via screenshot scaling—no HTML required.

Why it matters. Unlocks deployable web automation for any VLM, kills parser dependencies.

Builder takeaway. Screenshot your browser, feed to VLMs—robust tooling without DOM fragility. Paper

MolmoWeb: Open Visual Web Agents from Screenshots Only

Three things in agentic AI, every Tuesday.