Microsoft launches MAI Transcribe, Voice, and Image for agent stacks

Microsoft’s MAI superintelligence team has rolled out three in-house models—MAI Transcribe One, MAI Voice One, and MAI Image Two—positioned as a direct competitor to OpenAI’s speech and image offerings. MAI Transcribe One is a speech-to-text model covering 25 languages, reportedly 2.5x faster than Microsoft’s previous Azure “fast” tier and ranked first globally on the FLEURS benchmark, above OpenAI’s Whisper. MAI Voice One generates 60 seconds of natural audio in under one second and supports custom voice creation from just a few seconds of sample audio, priced at $22 per million characters. MAI Image Two debuted as a top-three image model family on the arena.ai leaderboard, already integrated into Bing, PowerPoint, and Copilot, with pricing at $5 per million input tokens. All three are available via Microsoft Foundry and the MAI playground.

What changed. Microsoft is no longer relying solely on OpenAI for speech and image capabilities: it has shipped its own MAI Transcribe, Voice, and Image models, optimized for speed, quality, and integration across its productivity and Copilot platforms.

Why it matters. For agent builders inside the Microsoft ecosystem, this creates a vertically integrated multimodal stack that can handle listening, speaking, and seeing within a single vendor environment, simplifying security, billing, and deployment.

Builder takeaway. If your agents live in Copilot Studio, Azure, or Microsoft 365, you can now wire up voice-enabled, audiovisual agents by chaining MAI Transcribe, Voice, and Image, and you should reassess your dependency on OpenAI or third-party APIs for these capabilities.

Microsoft launches MAI Transcribe, Voice, and Image for agent stacks

Three things in agentic AI, every Tuesday.

More news