Benchmark

τ-bench

Tool-use evaluation across realistic transactional workflows.

Measures

Tool-use under user-style policies

Current leader

Claude Sonnet 4.5

τ-bench is purpose-built for evaluating tool-use under realistic policy constraints — the closest public benchmark to enterprise transactional workloads.