Guide · 6-8 hours

Build a replay-based eval set in a weekend

How to capture, redact, and score real production sessions to evaluate agent candidates.

Prereqs

Production agent running
Storage for logs

Apr 12, 2026 intermediate general

Replay sets close the loop between research and ops. Build one over a weekend.

What you need

50-200 real user sessions with consent. Outcomes labeled. Inputs redacted.

Step 1: Capture

Log every tool call, model call, and final outcome. Add a session ID and a stable user hash.

Step 2: Redact

Run a PII scrubber before storing. Never lose this discipline; retroactive cleanup is never complete.

Step 3: Label outcomes

Three-class labels are enough: success, partial, failure. Resist the temptation to invent six-point scales.

Step 4: Score

Replay every candidate against the set. Track three metrics: outcome, mean tool calls, p95 latency.

The Agent Brief

Three things in agentic AI, every Tuesday.

What changed, what matters, what builders should do next. No hype. No paid placement.