Topic

production

Coverage, reference pages, tools, and guides connected to this topic.

  1. Build a replay-based eval set in a weekend

    How to capture, redact, and score real production sessions to evaluate agent candidates.

  2. Six failure modes in tool-using agents, and the patterns that fix them

    An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.

  3. The case for replay-based agent evaluation

    Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.