benchmarks
Coverage, reference pages, tools, and guides connected to this topic.
-
SWE-bench Verified hits 78%, prompting calls for a harder coding eval
Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.