AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair
New benchmark exposes ranking instability in agent repair leaderboards due to evaluator reconfiguration, enabling more reliable evaluation of AI agent debugging capabilities.
Agentic AI leaderboards for repair and debugging tasks suffer from severe ranking instability when evaluators are reconfigured, as demonstrated by the new AuditRepairBench dataset. This 300-scenario corpus, spanning six risk categories, pairs execution traces to isolate instability sources—finding that methods incorporating evaluator feedback during repair selection drive much of the volatility. Submitted to NeurIPS 2026’s Evaluation and Directions Track, it equips builders with tools to audit leaderboard trustworthiness.
What changed. AuditRepairBench provides the first paired-trace benchmark for dissecting evaluator-induced reordering in agent repair leaderboards.
Why it matters. Reliable benchmarks are foundational for advancing agentic AI, especially in safety-critical repair tasks where misleading rankings could propagate flawed systems.
For builders, this signals a need to redesign repair architectures that decouple internal selection from external evaluators, ensuring robust performance beyond leaderboard hype. Read the paper.