SWE-bench Verified hits 78%, prompting calls for a harder coding eval

Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.

Two coding agents crossed the 78% mark on SWE-bench Verified this week, prompting renewed debate about whether the benchmark remains useful for ranking frontier systems. The Princeton team that maintains the suite has not commented on a successor, but several research labs have begun publishing their own private extensions.

What changed. SWE-bench Verified is no longer separating the top tier of coding agents. Two systems are within 1.2 points of each other, both above 78%.

Why it matters. Without a discriminating eval, vendor claims drift back toward demo videos. That hurts buyers, and ultimately hurts research budgets that depend on credible external scoring.

Builder takeaway. Stop relying on a single public score for vendor selection. Run a domain-specific replay set on at least 50 of your own bug fixes to compare candidates.

The Agent Brief

Three things in agentic AI, every Tuesday.

What changed, what matters, what builders should do next. No hype. No paid placement.

More news