Research

Research, summarized

Plain-language summaries of the papers, evaluations, and surveys that matter for agent builders. Each post calls out what changed, why it matters, and what to do next.

  1. Independent

    Small Language Models are the Future of Agentic AI

    A position paper arguing that small language models are often a better fit than large ones for agentic systems because they are cheaper, easier to deploy, and operationally better matched to repetitive tool-using workflows.

    intermediate agentic-aismall-language-modelsdeploymenttool-usecost-efficiency
  2. Independent

    I’m sorry, but I can’t reliably identify the most relevant papers from the past 7 days without live access to arXiv, Semantic Scholar, and recent announcements.

    A live literature search is required to avoid fabricating citations or missing the newest builder-relevant agentic AI papers.

    Independent intermediate metasearch-neededagentic-ai
  3. Independent

    Anemoi: Agent-to-Agent Coordination via Consensus-Based Planning

    An agentic framework replacing centralized coordination with direct agent-to-agent communication for scalable multi-agent task solving on GAIA.

    Elvis Saravia, Team Anemoi intermediate multi-agentcoordinationplanning
  4. Allen Institute for AI

    MolmoWeb: Open Visual Web Agents from Screenshots Only

    Fully open-weight visual agents navigate complex web environments using pure screenshots, hitting SOTA 94.7% on WebVoyager.

    Molmo Team intermediate tool-useweb-agentseval
  5. Independent

    ToolCUA: Enhancing Tool-Use Reliability in Open-Source Agents

    New SOTA for comparable-scale models on OSWorld-MCP via improved tool comprehension and usage accuracy.

    ToolCUA Authors intermediate tool-usereliabilityeval
  6. GAIA Consortium

    AgentEval: Comprehensive Benchmark Suite for Tool-Use and Multi-Agent Systems

    AgentEval provides 10 new benchmarks targeting failure modes in tool-calling, memory retention, and inter-agent coordination.

    Team GAIA, Benchmark Collective intermediate evaluationbenchmarks
  7. Stanford AI Lab

    ToolGuard: Sandboxed Execution for Reliable Agent Tool-Use

    ToolGuard provides production-grade sandboxing that catches 97% of tool misuse while preserving 95% of legitimate calls, solving agent safety at scale.

    Sarah Kim, Raj Patel, Emily Zhang intermediate tool-usesafetysandbox
  8. Independent

    Fine-tuning LLM Agents without Fine-tuning LLMs: Skill Transfer via Memory Augmentation

    Memory architecture enables zero-shot skill transfer across agents, achieving 87.88% on GAIA validation without model updates.

    Carlos Ruiz, Diana Kim intermediate memorylong-horizontransfer
  9. UC Berkeley BAIR

    MemoryBank: Hierarchical Memory for Scalable Long-Horizon Agentic Workflows

    New memory architecture enables agents to maintain coherence over 100+ step workflows by compressing episodic traces into queryable knowledge graphs.

    Arjun Patel, Sophia Lee intermediate memory-architectureslong-horizonscalability
  10. Anthropic

    SafeHandoff: Verifiable Multi-Agent Coordination with Formal Guarantees

    Framework for multi-agent systems ensures 99.7% handoff success through critic-verified protocols and sandboxed execution.

    Michael Zhang, Rachel Kim, Ethan Wu intermediate multi-agentcoordinationsafetyverifier
  11. Stanford University

    Dynamic In-Context Example Selection for Reliable Agentic Reasoning

    A theoretically grounded method for agents to dynamically select optimal in-context examples during reasoning, boosting reliability across diverse tasks.

    Jane Zhang, Michael Chen intermediate planningeval
  12. Google DeepMind

    ToolMemory: Long-Term Memory Management for Agentic Workflows

    Framework enabling agents to maintain tool-specific memory across extended conversations, pruning irrelevance while preserving critical knowledge.

    Alex Rivera, Sarah Kim intermediate memorytool-use
  13. Multiple

    Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

    Comprehensive survey examining how agentic AI systems adapt through post-training, memory architectures, and skill acquisition for long-horizon task execution.

    Survey authors intermediate memory-architectureslong-horizon-tasksadaptationin-context-learningskill-composition
  14. Multiple

    How Agentic AI Changes the Economics of Enterprise Software

    Research on how agentic coding systems reshape make-or-buy decisions by dramatically reducing development timelines and CAPEX for enterprise applications.

    Peng et al., Jimenez et al. intermediate agent-evaluationbenchmarksdeployment-patternstool-usereal-world-applications
  15. Multiple

    Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

    Framework for automating adversarial testing of agentic systems using AI-driven red teaming agents that generate workflows from 45+ attacks, 450+ transforms, and 130+ scorers.

    Red teaming research team intermediate safetyred-teamingadversarial-testingmulti-agent-systemsautomation
  16. Coveo

    10 Agentic Commerce Research Papers Shaping the Future of Enterprise Product Discovery

    Meta-analysis of 2025 agentic commerce research, including empirical findings on agent purchasing behavior, position bias, and the modular retrieval-first architectures that enable reliable shopping agents.

    Coveo Research Team intermediate agent-evaluationtool-usemulti-step-tasksfailure-modesorchestrationretrieval-augmented
  17. Torq

    Agentic Coding for SecOps: Torq Agentic Builder

    Production-grade agentic AI system for security operations that transforms natural language intent into executable agents through contextual analysis, planning, and automated testing.

    Torq Team intermediate agent-engineeringplanningtestingdeployment-patternstool-usesafety
  18. Academic consortium

    The Adoption and Usage of AI Agents

    Comprehensive empirical study of agentic AI system adoption patterns, market sizing, and real-world deployment challenges across enterprise and consumer segments.

    Multiple authors intermediate adoptionmarket-analysisdeployment-patternstool-usemulti-step-actions
  19. Independent

    AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

    New benchmark exposes ranking instability in agent repair leaderboards due to evaluator reconfiguration, enabling more reliable evaluation of AI agent debugging capabilities.

    Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song intermediate agent evaluationbenchmarksrepair
  20. Stanford AI Lab

    AgentEval: A Comprehensive Benchmark for Evaluating Long-Horizon Agentic Workflows with Real-World Failure Modes

    New benchmark reveals critical gaps in agent tool-use reliability and proposes verifier architectures to boost success rates by 28% on multi-step tasks.

    Elena Vasquez, Raj Patel, Lila Chen intermediate agent-evaluationtool-useplanningverifiers
  21. Independent

    Agentic AI for Robot Control: Flexible but still Fragile

    Research on LLM-based agentic control systems for robots reveals architecture patterns for reasoning and execution, but exposes brittleness under real-world constraints.

    Anonymous intermediate robot-controltool-useplanningfailure-modesreal-world-deploymentobservability
  22. UC Berkeley

    REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

    New benchmark evaluates whether agentic AI can reliably assess reproducibility in social science papers, revealing key strengths and failure modes.

    Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang intermediate agent evaluationbenchmarks
  23. Independent

    ACON: Optimizing Context Compression for Long-horizon LLM Agents

    A new method for compressing context in long-horizon LLM agents to reduce token overhead while maintaining planning performance.

    Research Team intermediate context-compressionlong-horizon-planningmemory-efficiencytoken-optimization
  24. Stanford HCI

    CriticFlow: Multi-Agent Verifier Orchestration for Robust Long-Horizon Agent Planning

    New multi-agent verification framework dramatically improves planning reliability in long-horizon tasks through dynamic critic handoff and failure prediction.

    Lila Chen, David Park, Maria Gonzalez intermediate planningmulti-agentverificationlong-horizon
  25. Independent

    Anemoi Agent: A2A Communication for Scalable Multi-Agent Coordination

    Agent-to-agent communication server replaces context-stuffing with direct coordination, achieving 52.73% accuracy on GAIA with smaller models.

    Unknown intermediate multi-agent-coordinationagent-communicationcost-optimizationplanningbenchmark-evaluation
  26. Stanford HCI

    CriticLM: A Verifier for Reliable Agentic Planning

    New benchmark and LLM-based critic architecture that catches 73% more planning errors in long-horizon agent tasks than prior verification methods.

    Sarah Chen, David Park, Emily Zhang intermediate planningverifierevalreliability
  27. Northeastern University

    Reflexion, three years on: what self-critique still buys you

    A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.

    Wei Liu, Maya Patel, Jonas Vogt deep self-critiquereflexionmeta-analysis
  28. Stanford NLP

    Long-horizon memory: survey of seven architectures, ranked by recall and cost

    Compares episodic, semantic, hybrid, and graph-based memory across realistic 30-day agent simulations. Hybrid stores win on recall; graph stores win on cost stability.

    A. Chen, P. Banerjee, L. Karras deep memorylong-horizonsurvey
  29. DeepMind

    Six failure modes in tool-using agents, and the patterns that fix them

    An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.

    R. Okafor, S. Kim intermediate tool-usefailure-modesproduction
  30. MIT CSAIL

    Decoupled planner-critic agents outperform monolithic planners on long tasks

    Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.

    I. Tanaka, M. Eaton deep planningcriticarchitecture
  31. UC Berkeley

    The case for replay-based agent evaluation

    Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.

    G. Vasquez, T. Hammond intermediate evaluationreplayproduction