Guide · 6-8 hours

Build a replay-based eval set in a weekend

How to capture, redact, and score real production sessions to evaluate agent candidates.

Prereqs
  • Production agent running
  • Storage for logs

Replay sets close the loop between research and ops. Build one over a weekend.

What you need

50-200 real user sessions with consent. Outcomes labeled. Inputs redacted.

Step 1: Capture

Log every tool call, model call, and final outcome. Add a session ID and a stable user hash.

Step 2: Redact

Run a PII scrubber before storing. Never lose this discipline; retroactive cleanup is never complete.

Step 3: Label outcomes

Three-class labels are enough: success, partial, failure. Resist the temptation to invent six-point scales.

Step 4: Score

Replay every candidate against the set. Track three metrics: outcome, mean tool calls, p95 latency.

The Agent Brief

Three things in agentic AI, every Tuesday.

What changed, what matters, what builders should do next. No hype. No paid placement.