Replay-based evaluation
Scoring agent candidates against captured real-world sessions with held-out outcomes.
Replay-based evaluation is the most reliable signal for production deployments. Static benchmarks miss roughly half the failure modes seen in production traces.