Topic

benchmark

Coverage, reference pages, tools, and guides connected to this topic.

  1. CLI Agents Drive 30% Faster Code Shipping for Developers

    Command-line AI agents are replacing traditional IDEs, with developers reporting 30% faster code shipping velocity.

  2. GAIA

    General AI assistants benchmark.

  3. OSWorld

    Computer-use benchmark spanning OS, browser, and productivity apps.

  4. SWE-bench Verified

    Verified subset of SWE-bench, the canonical coding-agent benchmark.

  5. WebArena

    Web-navigation benchmark for browser agents.

  6. τ-bench

    Tool-use evaluation across realistic transactional workflows.