Skip to content
Home » NVIDIA AI Introduces Nemotron-3-Nano-30B to NVFP4 Using Quantization Aware Distillation (QAD) for Enhanced Inference Efficiency

NVIDIA AI Introduces Nemotron-3-Nano-30B to NVFP4 Using Quantization Aware Distillation (QAD) for Enhanced Inference Efficiency

NVIDIA has just launched a new model called Nemotron-Nano-3-30B-A3B-NVFP4. This model is designed to run a 30 billion parameter reasoning model using a special 4-bit format known as NVFP4. What’s impressive is that it maintains accuracy similar to a higher precision baseline called BF16.

The new model uses a unique architecture that combines a Mamba2 Transformer with a Mixture of Experts setup. It also employs a method called Quantization Aware Distillation (QAD), which is specifically tailored for NVFP4. This advancement allows Nemotron-Nano to achieve up to four times the processing speed on NVIDIA’s latest Blackwell B200 hardware.

So, what exactly is Nemotron-Nano-3-30B-A3B-NVFP4? It’s a quantized version of an earlier model, Nemotron-3-Nano-30B-A3B-BF16. The NVIDIA team built it from the ground up as a versatile model for reasoning and chatting. It has 30 billion parameters and consists of 52 layers, including 23 Mamba2 and MoE layers. Each MoE layer utilizes 128 experts, with six being active for every token processed. This setup results in around 3.5 billion active parameters for each token.

The training process for this model involved pre-training on 25 trillion tokens, using a learning rate strategy designed to optimize performance. After this, a three-stage fine-tuning process was implemented. This included supervised fine-tuning on various data types, reinforcement learning, and finally, the quantization process to NVFP4.

The NVFP4 format is noteworthy because it’s a 4-bit floating point format that significantly improves efficiency. Compared to another format, FP8, NVFP4 offers two to three times the arithmetic speed and reduces memory usage by about 1.8 times. Its design features a smaller block size and a dual scaling system, which helps maintain accuracy while using less memory.

One of the key innovations in this release is the shift from traditional Quantization Aware Training (QAT) to QAD. While QAT focuses on minimizing task loss, QAD uses a frozen BF16 model as a teacher to guide the NVFP4 model. This approach allows the new model to align more closely with the teacher’s output without needing to replicate the entire training process.

Benchmarks show that the new model performs exceptionally well, achieving up to 99.4% of the accuracy of the BF16 version. In tests across various reasoning and coding benchmarks, the NVFP4 model showed less accuracy loss compared to previous quantization methods, making it a strong choice for developers.

In conclusion, NVIDIA’s release of Nemotron-Nano-3-30B-A3B-NVFP4 marks a significant step forward in AI model efficiency and performance. With its advanced architecture and innovative training methods, it promises to enhance the capabilities of reasoning models while keeping resource usage low.