MolmoWeb: Open Visual Web Agents from Screenshots Only
Fully open-weight visual agents navigate complex web environments using pure screenshots, hitting SOTA 94.7% on WebVoyager.
MolmoWeb proves you don’t need HTML trees or APIs for web agents—just good vision models eating screenshots. Open-weight VLMs click through messy sites at 94.7% pass@4 on WebVoyager, scaling linearly with compute. Bridges huge gap to closed models without infrastructure hacks.
What changed. Pure vision web agents match SOTA opens (94.7% WebVoyager) via screenshot scaling—no HTML required.
Why it matters. Unlocks deployable web automation for any VLM, kills parser dependencies.
Builder takeaway. Screenshot your browser, feed to VLMs—robust tooling without DOM fragility. Paper