UC Berkeley

The case for replay-based agent evaluation

Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.

The authors argue that replay-based evaluation — capturing real user sessions and scoring agent candidates against a held-out outcome — is the most reliable signal for production deployments. Static benchmarks miss approximately half the failure modes observed in production traces.

What changed. A practical framework for building replay sets, including consent capture, redaction, and outcome labeling.

Why it matters. Replay sets close the loop between research and ops. They let you ship upgrades with quantifiable confidence.

Builder takeaway. Carve 1-2% of production traffic for evaluation capture. Build a redaction pipeline before you have data you cannot afford to lose.

The Agent Brief

Three things in agentic AI, every Tuesday.

What changed, what matters, what builders should do next. No hype. No paid placement.