Stanford HCI

CriticLM: A Verifier for Reliable Agentic Planning

New benchmark and LLM-based critic architecture that catches 73% more planning errors in long-horizon agent tasks than prior verification methods.

Stanford HCI drops CriticLM, a verifier that catches agent planning failures after execution begins but before they cascade into tool-call disasters. Trained on 12K synthetic traces, it spots subgoal drift and reroutes 68% of failing trajectories on AgentBench-Web.

What changed. Verifiers now catch 73% more agent planning errors by reasoning over execution traces rather than just prompts.

The real breakthrough: execution-trace reasoning. Instead of “does this plan look good?”, CriticLM asks “did subgoal 3 actually complete?” and “is the current tool call consistent with remaining objectives?” This catches the failure modes that kill 70% of production agents—mid-execution drift.

Why it matters. 68% of long-horizon agent failures happen mid-execution; this catches them before cascading tool failures.

Deployment pattern: Add CriticLM as a post-action interceptor. 70B model, 1.2s latency, 2x reliability gain. Every agent scaffold now needs this layer.

Builder takeaway. Add a 70B critic LLM post-execution—it boosts reliability 2x for the cost of one extra inference call.

Read the paper

The Agent Brief

Three things in agentic AI, every Tuesday.

What changed, what matters, what builders should do next. No hype. No paid placement.