BreakingAgent

BreakingAgentIndependent intelligence on agentic AI: what is changing, what matters, and what builders should do next.https://breakingagent.com/en-usAnthropic moves Computer Use out of beta, ships native sandbox primitivehttps://breakingagent.com/news/anthropic-computer-use-ga/https://breakingagent.com/news/anthropic-computer-use-ga/Claude's screen-grounded agent loop graduates with new tool-use primitives, an isolated sandbox, and tighter rate-limit policy for production deployments.Wed, 22 Apr 2026 09:30:00 GMTanthropiccomputer-usebrowser-agentssandboxOpenAI ships Swarm 2 with built-in handoff tracing and per-agent budgetshttps://breakingagent.com/news/openai-swarm-2-multi-agent/https://breakingagent.com/news/openai-swarm-2-multi-agent/Swarm 2 introduces a structured handoff log, hard token budgets per agent, and an interoperability shim for LangGraph and CrewAI.Sun, 19 Apr 2026 16:05:00 GMTopenaimulti-agentorchestrationtracing[Research] Reflexion, three years on: what self-critique still buys youhttps://breakingagent.com/research/reflexion-revisited/https://breakingagent.com/research/reflexion-revisited/A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.Sat, 18 Apr 2026 10:00:00 GMTself-critiquereflexionmeta-analysisGoogle opens Gemini Agent SDK with first-party MCP server registryhttps://breakingagent.com/news/google-gemini-agent-sdk/https://breakingagent.com/news/google-gemini-agent-sdk/The Agent SDK ships with a curated MCP registry, native long-running task support, and managed memory tied to Vertex AI.Wed, 15 Apr 2026 11:00:00 GMTgooglegeminimcpsdk[Research] Long-horizon memory: survey of seven architectures, ranked by recall and costhttps://breakingagent.com/research/long-horizon-memory-survey/https://breakingagent.com/research/long-horizon-memory-survey/Compares episodic, semantic, hybrid, and graph-based memory across realistic 30-day agent simulations. Hybrid stores win on recall; graph stores win on cost stability.Tue, 14 Apr 2026 09:30:00 GMTmemorylong-horizonsurveySWE-bench Verified hits 78%, prompting calls for a harder coding evalhttps://breakingagent.com/news/swe-bench-verified-saturated/https://breakingagent.com/news/swe-bench-verified-saturated/Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.Sun, 12 Apr 2026 08:00:00 GMTbenchmarksevaluationcoding-agentsEU AI Office issues draft guidance on autonomous agent disclosureshttps://breakingagent.com/news/eu-ai-act-agent-guidance/https://breakingagent.com/news/eu-ai-act-agent-guidance/The draft requires clear disclosure when agents act on a user's behalf in regulated transactions, plus an audit log requirement for high-risk deployments.Thu, 09 Apr 2026 14:25:00 GMTregulationeu-ai-actgovernancecompliance[Research] Six failure modes in tool-using agents, and the patterns that fix themhttps://breakingagent.com/research/tool-use-failure-modes/https://breakingagent.com/research/tool-use-failure-modes/An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.Wed, 08 Apr 2026 13:15:00 GMTtool-usefailure-modesproduction[Research] Decoupled planner-critic agents outperform monolithic planners on long taskshttps://breakingagent.com/research/planner-critic-decoupling/https://breakingagent.com/research/planner-critic-decoupling/Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.Sat, 04 Apr 2026 10:00:00 GMTplanningcriticarchitecture[Research] The case for replay-based agent evaluationhttps://breakingagent.com/research/agent-eval-replay-sets/https://breakingagent.com/research/agent-eval-replay-sets/Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.Mon, 30 Mar 2026 08:45:00 GMTevaluationreplayproduction