ToolCUA: Enhancing Tool-Use Reliability in Open-Source Agents
New SOTA for comparable-scale models on OSWorld-MCP via improved tool comprehension and usage accuracy.
ToolCUA targets the Achilles heel of agents: actually using tools right. By drilling comprehension-usage-action pipelines, it lifts open models 66% on OSWorld-MCP desktop tasks to new SOTA 46.85%. Exposes how generic finetuning leaves tool reliability on the table.
What changed. 66% relative lift to 46.85% on OSWorld-MCP via tool-specific training.
Why it matters. Tool failures kill 50%+ of agent runs— this fixes the basics.
Builder takeaway. Don’t just RLHF agents; decompose tool skills into CUA modules. Paper