Skip to content
Home » Microsoft Launches Maia 200: An AI Inference Accelerator Optimized for FP4 and FP8 in Azure Datacenters

Microsoft Launches Maia 200: An AI Inference Accelerator Optimized for FP4 and FP8 in Azure Datacenters

Microsoft has unveiled its latest innovation, the Maia 200, an in-house AI accelerator designed to enhance inference capabilities in Azure datacenters. This new chip aims to improve the efficiency and cost-effectiveness of generating tokens for large language models and other reasoning tasks. The Maia 200 combines specialized computing power, a sophisticated memory structure, and a high-speed Ethernet network to achieve its goals.

The need for a dedicated inference chip stems from the different demands of training and inference processes. Training requires extensive communication and long processing times, while inference focuses on maximizing tokens generated per second and minimizing latency and costs. Microsoft claims that the Maia 200 delivers approximately 30% better performance per dollar compared to the previous hardware in its lineup.

This new chip is part of a diverse Azure ecosystem and will support various models, including OpenAI’s latest GPT 5.2. It will also power applications like Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence team plans to use Maia 200 for generating synthetic data and reinforcement learning, which will help improve their in-house models.

In terms of specifications, the Maia 200 is built on TSMC’s advanced 3-nanometer process and features over 140 billion transistors. It utilizes native FP8 and FP4 tensor cores, achieving more than 10 petaFLOPS in FP4 and over 5 petaFLOPS in FP8, all within a 750W power envelope. The chip includes 216 GB of HBM3e memory with a remarkable bandwidth of about 7 TB per second, along with 272 MB of on-chip SRAM.

The architecture of Maia 200 is organized into tiles and clusters, each functioning as independent computing and storage units. This design allows for efficient data movement, reducing bottlenecks that can slow down inference tasks. The chip also incorporates a custom Network on Chip that optimizes data transfer between different components, ensuring that larger data transfers do not hinder smaller, critical communications.

Additionally, the Maia 200 features an integrated network interface card (NIC) that supports a high-speed Ethernet connection, capable of delivering up to 1.4 TB per second in each direction. This setup allows for scaling up to 6,144 accelerators, enhancing the chip’s performance in multi-accelerator environments.

From a system integration standpoint, the Maia 200 adheres to the same standards as Azure GPU servers, supporting both air and liquid cooling options. This flexibility allows for mixed deployments of GPUs and Maia accelerators within the same datacenter footprint. The chip also integrates seamlessly with Azure’s control plane, enabling efficient firmware management and maintenance without disrupting ongoing AI workloads.

Overall, the Maia 200 represents a significant step forward in AI inference technology, aiming to set a new standard for performance and cost-efficiency in cloud-based AI applications.