Topic

evaluation

Coverage, reference pages, tools, and guides connected to this topic.

News Apr 12, 2026

SWE-bench Verified hits 78%, prompting calls for a harder coding eval

Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.

benchmarks evaluation coding-agents
Research Mar 30, 2026

The case for replay-based agent evaluation

Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.

evaluation replay production