Decoupled planner-critic agents outperform monolithic planners on long tasks
Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.
A decoupled architecture — a smaller planner generates a tree of candidate steps, a larger critic prunes — outperforms monolithic planners by 14 points on a multi-day research benchmark while reducing total token cost by 28%.
What changed. Empirical validation that role specialization (planner vs. critic) beats a single high-capacity model running both jobs.
Why it matters. This is a cost-quality Pareto improvement. Most teams default to “biggest model everywhere” and leave value on the table.
Builder takeaway. Try a small planner + frontier critic on your hardest workloads. Expect to spend a week tuning the exchange protocol before seeing the gain.