benchmark
Coverage, reference pages, tools, and guides connected to this topic.
-
GAIA
General AI assistants benchmark.
-
OSWorld
Computer-use benchmark spanning OS, browser, and productivity apps.
-
SWE-bench Verified
Verified subset of SWE-bench, the canonical coding-agent benchmark.
-
WebArena
Web-navigation benchmark for browser agents.
-
τ-bench
Tool-use evaluation across realistic transactional workflows.