replay
Coverage, reference pages, tools, and guides connected to this topic.
-
Build a replay-based eval set in a weekend
How to capture, redact, and score real production sessions to evaluate agent candidates.
-
The case for replay-based agent evaluation
Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.