Term

Replay-based evaluation

Scoring agent candidates against captured real-world sessions with held-out outcomes.

Replay-based evaluation is the most reliable signal for production deployments. Static benchmarks miss roughly half the failure modes seen in production traces.