τ-bench
Tool-use evaluation across realistic transactional workflows.
τ-bench is purpose-built for evaluating tool-use under realistic policy constraints — the closest public benchmark to enterprise transactional workloads.
Tool-use evaluation across realistic transactional workflows.
τ-bench is purpose-built for evaluating tool-use under realistic policy constraints — the closest public benchmark to enterprise transactional workloads.