Topic

benchmark

Coverage, reference pages, tools, and guides connected to this topic.

  1. GAIA

    General AI assistants benchmark.

  2. OSWorld

    Computer-use benchmark spanning OS, browser, and productivity apps.

  3. SWE-bench Verified

    Verified subset of SWE-bench, the canonical coding-agent benchmark.

  4. WebArena

    Web-navigation benchmark for browser agents.

  5. τ-bench

    Tool-use evaluation across realistic transactional workflows.