SWE-bench Verified hits 78%, prompting calls for a harder coding eval

Two coding agents crossed the 78% mark on SWE-bench Verified this week, prompting renewed debate about whether the benchmark remains useful for ranking frontier systems. The Princeton team that maintains the suite has not commented on a successor, but several research labs have begun publishing their own private extensions.

What changed. SWE-bench Verified is no longer separating the top tier of coding agents. Two systems are within 1.2 points of each other, both above 78%.

Why it matters. Without a discriminating eval, vendor claims drift back toward demo videos. That hurts buyers, and ultimately hurts research budgets that depend on credible external scoring.

Builder takeaway. Stop relying on a single public score for vendor selection. Run a domain-specific replay set on at least 50 of your own bug fixes to compare candidates.

SWE-bench Verified hits 78%, prompting calls for a harder coding eval

Three things in agentic AI, every Tuesday.

More news