openbmb
/

DensingLaw-ScalingModels

+# DensingLaw-ScalingBench
+This dataset was created to enable a more accurate performance evaluation of Large Language Models (LLMs). It addresses the limitations of traditional evaluation methods—which often focus solely on the final answer—by providing detailed, GPT-4o generated reasoning steps (Chain-of-Thought) for each instance in benchmark test sets.
+This dataset is released as part of our paper, **`Densing Law of LLMs`**.
+<div align="center">
+  <img src="assets/densinglaw.png" width="600"/>
+</div>
+<!-- <div align="center">
+English | [简体中文]()
+</div> -->
+<div align="center">
+[📜 Paper](https://arxiv.org/pdf/2412.04315)
+<!-- | [💻 Github Repo]() -->
+</div>
+## 💡 Overview
+When evaluating Large Language Models (LLMs), especially on complex tasks like mathematical problems or multiple-choice questions, comparing only the final answers is often insufficient. A model might arrive at the correct answer for the wrong reasons, or make a minor calculation error in an otherwise correct reasoning process. The vast majority of traditional benchmark datasets lack annotations for these essential reasoning steps.
+To address this gap, we propose a more robust evaluation framework. As stated in our paper:
+> It is important to note that most datasets do not provide reasoning steps for each instance. For both two types of tasks, we use GPT-4o to generate reasoning steps for all test instances. These approaches allow us to better estimate the model’s performance by considering the specific requirements and formats of different tasks.
+This dataset is the direct result of that work. We leveraged the powerful `GPT-4o` model to generate high-quality reasoning steps for all test instances in `MMLU, BBH, MATH, MBPP, HUMAN-EVAL`, enabling researchers to conduct a deeper and more equitable analysis of a model's logical reasoning capabilities.
+## 🎯 Motivation and Creation Process
+### Motivation
+Our core research objective is to define and calculate the "density" of LLMs, which is the ratio of their effective parameter size to their actual parameter size. This process requires a precise evaluation of model performance on specific downstream tasks. We identified shortcomings in traditional evaluation methods for several key task types:
+1.  **Multiple-Choice Questions**: Calculating loss based only on the correct option's token ignores the model's understanding and analysis of the problem itself.
+2.  **Complex Problems (e.g., Mathematics)**: These tasks often require a complete chain of reasoning to arrive at a solution. Evaluating the entire process is more reflective of a model's true capabilities than evaluating the single-token answer.
+3.  **Code Generation Problems**: For programming tasks, evaluating code solely on functional correctness (i.e., whether it passes unit tests) is insufficient. This overlooks crucial aspects of code quality, such as algorithmic **efficiency** (e.g., `O(n log n)` vs. `O(n^2)`), **readability**, and adherence to best practices. A model might generate a brute-force solution that passes all tests but is highly inefficient. Assessing the model's high-level plan or algorithmic logic provides a more comprehensive measure of its coding intelligence.
+### 🔧 Generation Process
+To construct this dataset, we followed these steps:
+1.  **Base Datasets**: We selected `MMLU, BBH, MATH, MBPP, HUMAN-EVAL` as our foundation.
+2.  **Prompt Engineering**: For each test question, we designed appropriate prompts to elicit detailed reasoning.
+3.  **Reasoning Generation**: We used the **GPT-4o** API to generate coherent, step-by-step reasoning that leads to a final answer.
+4.  **Integration**: We integrated these generated reasoning steps with the original questions and answers to create the new, augmented data instances.
+## ✅ Supported Tasks
+This dataset is designed to enhance the evaluation of tasks such as:
+  * Mathematical Reasoning
+  * Code Reasoning
+  * Multiple-Choice Question Answering
+## ⚠️ Disclaimer
+  * The reasoning steps included in this dataset were automatically generated by **GPT-4o**. While we have made efforts to ensure their quality, we cannot guarantee that every reasoning process is entirely correct or flawless.
+  * For any given problem, the solution provided by GPT-4o represents only one of many possible reasoning paths and should not be considered the sole "correct" method.
+  * We encourage users to treat these reasoning steps as "soft" labels or references for evaluating a model's logical capabilities, rather than as absolute ground truth.
+## 📜 License
+This dataset is released under the `Apache 2.0` license.
+## 📚 Citation
+If you use this dataset in your research, please cite our paper:
+```bibtex
+@misc{xiao2024densinglawllms,
+      title={Densing Law of LLMs},
+      author={Chaojun Xiao and Jie Cai and Weilin Zhao and Guoyang Zeng and Biyuan Lin and Jie Zhou and Zhi Zheng and Xu Han and Zhiyuan Liu and Maosong Sun},
+      year={2024},
+      eprint={2412.04315},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2412.04315},
+}
+```