Benchmark

τ-bench

Tool-use evaluation across realistic transactional workflows.

Measures
Tool-use under user-style policies
Current leader
Claude Sonnet 4.5

τ-bench is purpose-built for evaluating tool-use under realistic policy constraints — the closest public benchmark to enterprise transactional workloads.