CriticLM: A Verifier for Reliable Agentic Planning

New benchmark and LLM-based critic architecture that catches 73% more planning errors in long-horizon agent tasks than prior verification methods.

Stanford HCI drops CriticLM, a verifier that catches agent planning failures after execution begins but before they cascade into tool-call disasters. Trained on 12K synthetic traces, it spots subgoal drift and reroutes 68% of failing trajectories on AgentBench-Web.

What changed. Verifiers now catch 73% more agent planning errors by reasoning over execution traces rather than just prompts.

The real breakthrough: execution-trace reasoning. Instead of “does this plan look good?”, CriticLM asks “did subgoal 3 actually complete?” and “is the current tool call consistent with remaining objectives?” This catches the failure modes that kill 70% of production agents—mid-execution drift.

Why it matters. 68% of long-horizon agent failures happen mid-execution; this catches them before cascading tool failures.

Deployment pattern: Add CriticLM as a post-action interceptor. 70B model, 1.2s latency, 2x reliability gain. Every agent scaffold now needs this layer.

Builder takeaway. Add a 70B critic LLM post-execution—it boosts reliability 2x for the cost of one extra inference call.

Read the paper

CriticLM: A Verifier for Reliable Agentic Planning

Three things in agentic AI, every Tuesday.