Research, summarized
Plain-language summaries of the papers, evaluations, and surveys that matter for agent builders. Each post calls out what changed, why it matters, and what to do next.
- Northeastern University Apr 18, 2026
Reflexion, three years on: what self-critique still buys you
A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.
- Stanford NLP Apr 14, 2026
Long-horizon memory: survey of seven architectures, ranked by recall and cost
Compares episodic, semantic, hybrid, and graph-based memory across realistic 30-day agent simulations. Hybrid stores win on recall; graph stores win on cost stability.
- DeepMind Apr 8, 2026
Six failure modes in tool-using agents, and the patterns that fix them
An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.
- MIT CSAIL Apr 4, 2026
Decoupled planner-critic agents outperform monolithic planners on long tasks
Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.
- UC Berkeley Mar 30, 2026
The case for replay-based agent evaluation
Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.