evaluation
Coverage, reference pages, tools, and guides connected to this topic.
-
SWE-bench Verified hits 78%, prompting calls for a harder coding eval
Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.
-
The case for replay-based agent evaluation
Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.