<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>BreakingAgent</title><description>Independent intelligence on agentic AI: what is changing, what matters, and what builders should do next.</description><link>https://breakingagent.com/</link><language>en-us</language><item><title>Anthropic moves Computer Use out of beta, ships native sandbox primitive</title><link>https://breakingagent.com/news/anthropic-computer-use-ga/</link><guid isPermaLink="true">https://breakingagent.com/news/anthropic-computer-use-ga/</guid><description>Claude&apos;s screen-grounded agent loop graduates with new tool-use primitives, an isolated sandbox, and tighter rate-limit policy for production deployments.</description><pubDate>Wed, 22 Apr 2026 09:30:00 GMT</pubDate><category>anthropic</category><category>computer-use</category><category>browser-agents</category><category>sandbox</category></item><item><title>OpenAI ships Swarm 2 with built-in handoff tracing and per-agent budgets</title><link>https://breakingagent.com/news/openai-swarm-2-multi-agent/</link><guid isPermaLink="true">https://breakingagent.com/news/openai-swarm-2-multi-agent/</guid><description>Swarm 2 introduces a structured handoff log, hard token budgets per agent, and an interoperability shim for LangGraph and CrewAI.</description><pubDate>Sun, 19 Apr 2026 16:05:00 GMT</pubDate><category>openai</category><category>multi-agent</category><category>orchestration</category><category>tracing</category></item><item><title>[Research] Reflexion, three years on: what self-critique still buys you</title><link>https://breakingagent.com/research/reflexion-revisited/</link><guid isPermaLink="true">https://breakingagent.com/research/reflexion-revisited/</guid><description>A meta-analysis of 41 papers building on Reflexion-style self-critique loops finds modest, durable gains in coding and tool-use, and diminishing returns in open-ended reasoning.</description><pubDate>Sat, 18 Apr 2026 10:00:00 GMT</pubDate><category>self-critique</category><category>reflexion</category><category>meta-analysis</category></item><item><title>Google opens Gemini Agent SDK with first-party MCP server registry</title><link>https://breakingagent.com/news/google-gemini-agent-sdk/</link><guid isPermaLink="true">https://breakingagent.com/news/google-gemini-agent-sdk/</guid><description>The Agent SDK ships with a curated MCP registry, native long-running task support, and managed memory tied to Vertex AI.</description><pubDate>Wed, 15 Apr 2026 11:00:00 GMT</pubDate><category>google</category><category>gemini</category><category>mcp</category><category>sdk</category></item><item><title>[Research] Long-horizon memory: survey of seven architectures, ranked by recall and cost</title><link>https://breakingagent.com/research/long-horizon-memory-survey/</link><guid isPermaLink="true">https://breakingagent.com/research/long-horizon-memory-survey/</guid><description>Compares episodic, semantic, hybrid, and graph-based memory across realistic 30-day agent simulations. Hybrid stores win on recall; graph stores win on cost stability.</description><pubDate>Tue, 14 Apr 2026 09:30:00 GMT</pubDate><category>memory</category><category>long-horizon</category><category>survey</category></item><item><title>SWE-bench Verified hits 78%, prompting calls for a harder coding eval</title><link>https://breakingagent.com/news/swe-bench-verified-saturated/</link><guid isPermaLink="true">https://breakingagent.com/news/swe-bench-verified-saturated/</guid><description>Top coding agents now resolve more than three of every four tasks in SWE-bench Verified, reigniting debate over whether the benchmark still discriminates between systems.</description><pubDate>Sun, 12 Apr 2026 08:00:00 GMT</pubDate><category>benchmarks</category><category>evaluation</category><category>coding-agents</category></item><item><title>EU AI Office issues draft guidance on autonomous agent disclosures</title><link>https://breakingagent.com/news/eu-ai-act-agent-guidance/</link><guid isPermaLink="true">https://breakingagent.com/news/eu-ai-act-agent-guidance/</guid><description>The draft requires clear disclosure when agents act on a user&apos;s behalf in regulated transactions, plus an audit log requirement for high-risk deployments.</description><pubDate>Thu, 09 Apr 2026 14:25:00 GMT</pubDate><category>regulation</category><category>eu-ai-act</category><category>governance</category><category>compliance</category></item><item><title>[Research] Six failure modes in tool-using agents, and the patterns that fix them</title><link>https://breakingagent.com/research/tool-use-failure-modes/</link><guid isPermaLink="true">https://breakingagent.com/research/tool-use-failure-modes/</guid><description>An empirical taxonomy of agent tool-use failures across 4,000 traces from production deployments. Schema drift and silent partial-failure dominate.</description><pubDate>Wed, 08 Apr 2026 13:15:00 GMT</pubDate><category>tool-use</category><category>failure-modes</category><category>production</category></item><item><title>[Research] Decoupled planner-critic agents outperform monolithic planners on long tasks</title><link>https://breakingagent.com/research/planner-critic-decoupling/</link><guid isPermaLink="true">https://breakingagent.com/research/planner-critic-decoupling/</guid><description>Splitting planning and critique into specialized models with structured exchange yields a 14-point lift on multi-day research tasks.</description><pubDate>Sat, 04 Apr 2026 10:00:00 GMT</pubDate><category>planning</category><category>critic</category><category>architecture</category></item><item><title>[Research] The case for replay-based agent evaluation</title><link>https://breakingagent.com/research/agent-eval-replay-sets/</link><guid isPermaLink="true">https://breakingagent.com/research/agent-eval-replay-sets/</guid><description>Static benchmarks miss the failure modes that matter in production. This paper argues for replay sets — captured user sessions scored against a held-out outcome.</description><pubDate>Mon, 30 Mar 2026 08:45:00 GMT</pubDate><category>evaluation</category><category>replay</category><category>production</category></item></channel></rss>