SWE-bench Verified hits 78%, prompting calls for a harder coding eval
Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.
Two coding agents crossed the 78% mark on SWE-bench Verified this week, prompting renewed debate about whether the benchmark remains useful for ranking frontier systems. The Princeton team that maintains the suite has not commented on a successor, but several research labs have begun publishing their own private extensions.
What changed. SWE-bench Verified is no longer separating the top tier of coding agents. Two systems are within 1.2 points of each other, both above 78%.
Why it matters. Without a discriminating eval, vendor claims drift back toward demo videos. That hurts buyers, and ultimately hurts research budgets that depend on credible external scoring.
Builder takeaway. Stop relying on a single public score for vendor selection. Run a domain-specific replay set on at least 50 of your own bug fixes to compare candidates.