Benchmark

SWE-bench Verified

Verified subset of SWE-bench, the canonical coding-agent benchmark.

Measures
Real-world bug-fix resolution
Current leader
Claude Sonnet 4.5 (78.4%)

SWE-bench Verified is the current canonical eval for coding agents. As of April 2026 it is approaching saturation; see the news on this.