Research

Research, summarized

Plain-language summaries of the papers, evaluations, and surveys that matter for agent builders. Each post calls out what changed, why it matters, and what to do next.

  1. Northeastern University

    Reflexion, three years on: what self-critique still buys you

    A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.

    Wei Liu, Maya Patel, Jonas Vogt deep self-critiquereflexionmeta-analysis
  2. Stanford NLP

    Long-horizon memory: survey of seven architectures, ranked by recall and cost

    Compares episodic, semantic, hybrid, and graph-based memory across realistic 30-day agent simulations. Hybrid stores win on recall; graph stores win on cost stability.

    A. Chen, P. Banerjee, L. Karras deep memorylong-horizonsurvey
  3. DeepMind

    Six failure modes in tool-using agents, and the patterns that fix them

    An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.

    R. Okafor, S. Kim intermediate tool-usefailure-modesproduction
  4. MIT CSAIL

    Decoupled planner-critic agents outperform monolithic planners on long tasks

    Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.

    I. Tanaka, M. Eaton deep planningcriticarchitecture
  5. UC Berkeley

    The case for replay-based agent evaluation

    Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.

    G. Vasquez, T. Hammond intermediate evaluationreplayproduction