Reflexion, three years on: what self-critique still buys you
A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.
A new meta-analysis aggregates results from 41 papers that extend the original Reflexion self-critique loop. The headline: gains are real, but narrower than first reported.
What changed. A rigorous comparison across consistent benchmark families isolates the Reflexion lift from confounding factors (better base models, larger context windows, tool upgrades).
Why it matters. Self-critique remains a high-leverage pattern in coding and tool-use tasks (+6 to +11 points), but adds little or no value in open-ended creative reasoning tasks once the underlying model is strong enough.
Builder takeaway. Apply self-critique selectively. Use it on tasks with verifiable intermediate signals (test runs, type checks, schema validation). Skip it for free-form writing or planning where the critic does not have a ground-truth signal.