Breaking

Microsoft launches MAI Transcribe, Voice, and Image for agent stacks

Microsoft’s MAI team released three in-house AI models—Transcribe, Voice, and Image—aimed at powering end-to-end multimodal agents across its ecosystem.

Microsoft’s MAI superintelligence team has rolled out three in-house models—MAI Transcribe One, MAI Voice One, and MAI Image Two—positioned as a direct competitor to OpenAI’s speech and image offerings. MAI Transcribe One is a speech-to-text model covering 25 languages, reportedly 2.5x faster than Microsoft’s previous Azure “fast” tier and ranked first globally on the FLEURS benchmark, above OpenAI’s Whisper. MAI Voice One generates 60 seconds of natural audio in under one second and supports custom voice creation from just a few seconds of sample audio, priced at $22 per million characters. MAI Image Two debuted as a top-three image model family on the arena.ai leaderboard, already integrated into Bing, PowerPoint, and Copilot, with pricing at $5 per million input tokens. All three are available via Microsoft Foundry and the MAI playground.

What changed. Microsoft is no longer relying solely on OpenAI for speech and image capabilities: it has shipped its own MAI Transcribe, Voice, and Image models, optimized for speed, quality, and integration across its productivity and Copilot platforms.

Why it matters. For agent builders inside the Microsoft ecosystem, this creates a vertically integrated multimodal stack that can handle listening, speaking, and seeing within a single vendor environment, simplifying security, billing, and deployment.

Builder takeaway. If your agents live in Copilot Studio, Azure, or Microsoft 365, you can now wire up voice-enabled, audiovisual agents by chaining MAI Transcribe, Voice, and Image, and you should reassess your dependency on OpenAI or third-party APIs for these capabilities.

The Agent Brief

Three things in agentic AI, every Tuesday.

What changed, what matters, what builders should do next. No hype. No paid placement.

More news