Meta AI Unveils ‘NATURAL REASONING’: A Comprehensive Multi-Domain Dataset Featuring 2.8 Million Questions to Boost LLMs Reasoning Skills

Meta AI, in collaboration with researchers from New York University, has unveiled a groundbreaking dataset named NATURALREASONING, designed to significantly enhance the reasoning capabilities of large language models (LLMs). This dataset, which comprises 2.8 million reasoning questions, draws from a variety of disciplines, including Mathematics, Physics, Computer Science, and Economics, aiming to provide a more robust resource for evaluating and improving LLM performance in real-world reasoning tasks.

The announcement comes at a time when advancements in LLMs have been notable, particularly with models like OpenAI’s o1 and DeepSeek’s R1, which have excelled in traditional problem-solving benchmarks. However, existing datasets often fall short in offering diverse and open-ended reasoning assessments, limiting the ability to comprehensively evaluate these models’ true potential. The conventional datasets tend to focus on specific problem-solving tasks without addressing the need for a broader range of reasoning scenarios.

Previous efforts to enhance LLM reasoning have relied heavily on synthetic data generation and unsupervised self-training methods. While techniques such as STaR and MetaMath have shown promise, they depend on existing high-quality datasets, which can restrict their scalability. Moreover, unsupervised self-training approaches often require substantial human input, making them resource-intensive and costly.

In contrast, NATURALREASONING takes a novel approach by utilizing backtranslation from pretraining corpora to create authentic reasoning questions. This method not only ensures the dataset reflects real-world problems but also integrates a mix of verifiable and open-ended queries, including theorem proving. Such diversity is crucial for training algorithms that can enhance LLMs’ reasoning abilities beyond mere verification tasks.

The dataset’s effectiveness is demonstrated through two main strategies: knowledge distillation and supervised finetuning. These methods facilitate improved scaling trends compared to existing datasets. For instance, when targeting specific reasoning benchmarks like GPQA, the researchers extracted a sample of benchmark questions and identified similar decontaminated questions from NATURALREASONING using cosine similarity between question embeddings. This process enhances the dataset’s utility by clustering questions into groups for more focused training.

Evaluation results indicate that models trained with NATURALREASONING outperform others, such as Llama3.1-8B-Instruct, even with a smaller training set of 1.5 million examples. In contrast, datasets like OpenMathInstruct-2 and WebInstruct struggle to achieve comparable performance, particularly in generalization across different benchmarks. Math-specific datasets have shown strong results in math-related tasks but have not been able to maintain performance consistency in broader reasoning contexts.

In summary, the introduction of NATURALREASONING marks a significant step forward in advancing the capabilities of LLMs. By offering a diverse range of reasoning questions across multiple domains, this dataset aims to facilitate the development of more capable AI systems. The research highlights the potential for enhanced reasoning performance through knowledge distillation and unsupervised self-training, paving the way for future advancements in AI reasoning capabilities. Researchers and practitioners interested in exploring this dataset can find it available for access and further study.