Benchmarks
The public benchmarks most often cited for agent quality. We summarize what each measures and treat saturation seriously.
| Benchmark | Measures | Current leader |
|---|---|---|
| GAIA | Multi-step reasoning with tools and retrieval | GPT-5 |
| OSWorld | Cross-application desktop tasks | Claude (Computer Use) |
| SWE-bench Verified | Real-world bug-fix resolution | Claude Sonnet 4.5 (78.4%) |
| WebArena | Goal-directed web tasks across realistic sites | Claude (Computer Use) |
| τ-bench | Tool-use under user-style policies | Claude Sonnet 4.5 |