evals
Coverage, reference pages, tools, and guides connected to this topic.
-
Startups race to build self-evolving training sandboxes for agents
A new wave of startups is building synthetic, self-evolving environments to continuously train and stress-test agentic AI systems.
-
Prime Intellect plans 'GitHub for agent training environments'
Prime Intellect surfaced a vision for a shared repo of synthetic, self-evolving RL environments designed specifically to train and benchmark autonomous agents.
-
Enterprise GenAI pilots still struggle to deliver ROI, MIT says
A widely discussed MIT report argues that most enterprise GenAI pilots are failing to produce measurable returns, with integration and process fit emerging as the key issues.
-
Harvard study finds LLMs beat ER doctors on some diagnoses
A Harvard-led study reported that at least one large language model outperformed human emergency room doctors at diagnosing real-world cases, underscoring agent potential in clinical workflows.
-
Agentic AI defense takes center stage at RSA with Google Cloud updates
At RSAC, Google Cloud emphasized agentic AI for security operations, integrating live threat intelligence into automated defensive agents.
-
Anthropic ships Claude Code security tools for safer coding agents
Anthropic released Claude Code security enhancements aimed at reducing vulnerabilities introduced by coding agents that read, modify, and execute real codebases.
-
Collibra launches AI Command Center to monitor production agents
Collibra introduced AI Command Center to oversee AI systems and agents, including ownership, decisions, and risk, with integrated testing via Giskard.
-
CES 2026 Showcases AI Safety and Observability Breakthroughs
Fox News highlights 10 showstopping CES innovations focused on AI safety tools and observability for deployed systems.