Reference

Benchmarks

The public benchmarks most often cited for agent quality. We summarize what each measures and treat saturation seriously.

BenchmarkMeasuresCurrent leader
GAIA Multi-step reasoning with tools and retrieval GPT-5
OSWorld Cross-application desktop tasks Claude (Computer Use)
SWE-bench Verified Real-world bug-fix resolution Claude Sonnet 4.5 (78.4%)
WebArena Goal-directed web tasks across realistic sites Claude (Computer Use)
τ-bench Tool-use under user-style policies Claude Sonnet 4.5