Research

Research, summarized

Plain-language summaries of the papers, evaluations, and surveys that matter for agent builders. Each post calls out what changed, why it matters, and what to do next.

Northeastern University Apr 18, 2026

Reflexion, three years on: what self-critique still buys you

A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.

Wei Liu, Maya Patel, Jonas Vogt deep self-critiquereflexionmeta-analysis
Stanford NLP Apr 14, 2026

Long-horizon memory: survey of seven architectures, ranked by recall and cost

Compares episodic, semantic, hybrid, and graph-based memory across realistic 30-day agent simulations. Hybrid stores win on recall; graph stores win on cost stability.

A. Chen, P. Banerjee, L. Karras deep memorylong-horizonsurvey
DeepMind Apr 8, 2026

Six failure modes in tool-using agents, and the patterns that fix them

An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.

R. Okafor, S. Kim intermediate tool-usefailure-modesproduction
MIT CSAIL Apr 4, 2026

Decoupled planner-critic agents outperform monolithic planners on long tasks

Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.

I. Tanaka, M. Eaton deep planningcriticarchitecture
UC Berkeley Mar 30, 2026

The case for replay-based agent evaluation

Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.

G. Vasquez, T. Hammond intermediate evaluationreplayproduction

Reflexion, three years on: what self-critique still buys you

Long-horizon memory: survey of seven architectures, ranked by recall and cost

Six failure modes in tool-using agents, and the patterns that fix them

Decoupled planner-critic agents outperform monolithic planners on long tasks

The case for replay-based agent evaluation