NVIDIA AI Unveils TiDAR: A Hybrid Diffusion Autoregressive Framework for Enhanced LLM Inference Throughput

NVIDIA researchers have introduced TiDAR, a new type of language model designed to enhance the speed of generating text without sacrificing quality. This innovative model combines two approaches: diffusion and autoregressive sampling, allowing it to produce text more efficiently in a single pass. The main goal is to take advantage of unused processing power on modern GPUs while maintaining high output quality.

Typically, autoregressive models generate one token at a time, which can slow down the process, especially when handling large batches. TiDAR addresses this by using what the team calls "free token slots." These slots allow the model to predict multiple tokens simultaneously without significantly increasing the time it takes to generate text. However, previous diffusion models struggled with maintaining quality when generating multiple tokens at once, as they often produced tokens independently, leading to issues with coherence and factual accuracy.

To tackle this, TiDAR uses a unique architecture that divides the sequence of tokens into three sections: accepted tokens, previously drafted tokens, and masked tokens for future predictions. This arrangement allows for a structured attention mechanism where the model can efficiently draft and verify tokens in parallel.

During training, TiDAR employs a technique that doubles the sequence length, allowing it to learn from both the original and a corrupted version of the input. This approach ensures that the model can effectively balance the losses from both diffusion and autoregressive processes, improving overall performance.

In practice, TiDAR operates in two main steps. First, it drafts tokens based on the current input, and then it verifies these tokens using autoregressive logits. This method enables the model to either trust its diffusion predictions or its autoregressive predictions, depending on the context.

The researchers trained TiDAR using various model sizes, including 1.5 billion, 4 billion, and 8 billion parameters, on large datasets. They found that TiDAR achieved impressive results on several benchmarks, including coding and math tasks, while also generating more tokens per pass compared to traditional models.

In terms of speed, TiDAR demonstrated significant improvements. The 1.5 billion parameter model achieved a 4.71 times increase in decoding speed compared to its autoregressive counterpart, while the 8 billion model saw a 5.91 times boost. This efficiency comes without a substantial drop in quality, making TiDAR a strong competitor against both autoregressive models and other diffusion-based approaches.

Overall, TiDAR represents a significant advancement in the field of language models, offering a promising solution for faster and more accurate text generation. By leveraging free processing capacity and combining different modeling techniques, NVIDIA aims to push the boundaries of what is possible in natural language processing.