Reflexion, three years on: what self-critique still buys you

A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.

A new meta-analysis aggregates results from 41 papers that extend the original Reflexion self-critique loop. The headline: gains are real, but narrower than first reported.

What changed. A rigorous comparison across consistent benchmark families isolates the Reflexion lift from confounding factors (better base models, larger context windows, tool upgrades).

Why it matters. Self-critique remains a high-leverage pattern in coding and tool-use tasks (+6 to +11 points), but adds little or no value in open-ended creative reasoning tasks once the underlying model is strong enough.

Builder takeaway. Apply self-critique selectively. Use it on tasks with verifiable intermediate signals (test runs, type checks, schema validation). Skip it for free-form writing or planning where the critic does not have a ground-truth signal.

Reflexion, three years on: what self-critique still buys you

Three things in agentic AI, every Tuesday.