Topic

evaluation

Coverage, reference pages, tools, and guides connected to this topic.

  1. SWE-bench Verified hits 78%, prompting calls for a harder coding eval

    Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.

  2. The case for replay-based agent evaluation

    Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.