NVIDIA has just unveiled a groundbreaking development in artificial intelligence with the release of its Nemotron-Elastic-12B model. This innovative model is designed to simplify how AI development teams handle large language models by offering three different sizes—12 billion, 9 billion, and 6 billion parameters—all from a single training run. This approach eliminates the need for separate training for each model size, saving both time and resources.
Traditionally, AI systems require multiple model sizes to serve various needs. Larger models are typically used for heavy server workloads, while smaller models are better suited for devices with limited processing power or tight latency requirements. This often means that teams have to train or distill each model size separately, which can be costly in terms of both storage and processing power.
The Nemotron Elastic model changes this by combining these sizes into one flexible framework. It starts with a 12 billion parameter model and uses a unique training method that allows it to create smaller variants without additional optimization. This means that the smaller models share the same weights and metadata as the larger model, significantly reducing the overall training costs and memory requirements.
The architecture of the Nemotron Elastic model is built on a hybrid of Mamba-2 and Transformer technologies. It uses a dynamic masking system to adjust the model’s width and depth based on specific needs, allowing for efficient resource allocation depending on the task at hand. This flexibility is controlled by a router module that determines the best configuration for various budgets.
NVIDIA’s training process for this model consists of two stages. The first stage focuses on shorter contexts with a large batch size, while the second stage increases the context length and adjusts the sampling to favor the larger model. This method has shown promising results in reasoning tasks, with significant improvements in performance across the different model sizes.
Benchmark tests have demonstrated that the Nemotron Elastic model performs comparably to its predecessors, achieving high accuracy on reasoning-heavy tasks like MATH 500 and AIME 2025. Notably, the 12B variant matches the performance of the original Nemotron-Nano-V2-12B model while also providing the flexibility of smaller models.
In terms of efficiency, the Nemotron Elastic model requires significantly fewer training tokens compared to traditional methods. It needs only 110 billion tokens to derive the smaller models, a stark contrast to the hundreds of billions required by previous models. This efficiency extends to deployment as well, with the combined storage for all three model sizes only requiring 24GB, compared to 42GB for separate models.
Overall, the release of the Nemotron-Elastic-12B marks a significant step forward in AI model development, making it easier and more cost-effective for teams to deploy models of varying sizes. This innovation not only streamlines operations but also enhances the performance of AI systems across different applications.