Build a replay-based eval set in a weekend
How to capture, redact, and score real production sessions to evaluate agent candidates.
Prereqs
- Production agent running
- Storage for logs
Replay sets close the loop between research and ops. Build one over a weekend.
What you need
50-200 real user sessions with consent. Outcomes labeled. Inputs redacted.
Step 1: Capture
Log every tool call, model call, and final outcome. Add a session ID and a stable user hash.
Step 2: Redact
Run a PII scrubber before storing. Never lose this discipline; retroactive cleanup is never complete.
Step 3: Label outcomes
Three-class labels are enough: success, partial, failure. Resist the temptation to invent six-point scales.
Step 4: Score
Replay every candidate against the set. Track three metrics: outcome, mean tool calls, p95 latency.