CriticLM: A Verifier for Reliable Agentic Planning
New benchmark and LLM-based critic architecture that catches 73% more planning errors in long-horizon agent tasks than prior verification methods.
Stanford HCI drops CriticLM, a verifier that catches agent planning failures after execution begins but before they cascade into tool-call disasters. Trained on 12K synthetic traces, it spots subgoal drift and reroutes 68% of failing trajectories on AgentBench-Web.
What changed. Verifiers now catch 73% more agent planning errors by reasoning over execution traces rather than just prompts.
The real breakthrough: execution-trace reasoning. Instead of “does this plan look good?”, CriticLM asks “did subgoal 3 actually complete?” and “is the current tool call consistent with remaining objectives?” This catches the failure modes that kill 70% of production agents—mid-execution drift.
Why it matters. 68% of long-horizon agent failures happen mid-execution; this catches them before cascading tool failures.
Deployment pattern: Add CriticLM as a post-action interceptor. 70B model, 1.2s latency, 2x reliability gain. Every agent scaffold now needs this layer.
Builder takeaway. Add a 70B critic LLM post-execution—it boosts reliability 2x for the cost of one extra inference call.