Add files using upload-large-folder tool
Browse files
README.md
CHANGED
|
@@ -1,3 +1,87 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DensingLaw-ScalingBench
|
| 2 |
+
|
| 3 |
+
This dataset was created to enable a more accurate performance evaluation of Large Language Models (LLMs). It addresses the limitations of traditional evaluation methods—which often focus solely on the final answer—by providing detailed, GPT-4o generated reasoning steps (Chain-of-Thought) for each instance in benchmark test sets.
|
| 4 |
+
|
| 5 |
+
This dataset is released as part of our paper, **`Densing Law of LLMs`**.
|
| 6 |
+
|
| 7 |
+
<div align="center">
|
| 8 |
+
<img src="assets/densinglaw.png" width="600"/>
|
| 9 |
+
</div>
|
| 10 |
+
|
| 11 |
+
<!-- <div align="center">
|
| 12 |
+
English | [简体中文]()
|
| 13 |
+
</div> -->
|
| 14 |
+
|
| 15 |
+
<div align="center">
|
| 16 |
+
|
| 17 |
+
[📜 Paper](https://arxiv.org/pdf/2412.04315)
|
| 18 |
+
<!-- | [💻 Github Repo]() -->
|
| 19 |
+
|
| 20 |
+
</div>
|
| 21 |
+
|
| 22 |
+
## 💡 Overview
|
| 23 |
+
|
| 24 |
+
When evaluating Large Language Models (LLMs), especially on complex tasks like mathematical problems or multiple-choice questions, comparing only the final answers is often insufficient. A model might arrive at the correct answer for the wrong reasons, or make a minor calculation error in an otherwise correct reasoning process. The vast majority of traditional benchmark datasets lack annotations for these essential reasoning steps.
|
| 25 |
+
|
| 26 |
+
To address this gap, we propose a more robust evaluation framework. As stated in our paper:
|
| 27 |
+
|
| 28 |
+
> It is important to note that most datasets do not provide reasoning steps for each instance. For both two types of tasks, we use GPT-4o to generate reasoning steps for all test instances. These approaches allow us to better estimate the model’s performance by considering the specific requirements and formats of different tasks.
|
| 29 |
+
|
| 30 |
+
This dataset is the direct result of that work. We leveraged the powerful `GPT-4o` model to generate high-quality reasoning steps for all test instances in `MMLU, BBH, MATH, MBPP, HUMAN-EVAL`, enabling researchers to conduct a deeper and more equitable analysis of a model's logical reasoning capabilities.
|
| 31 |
+
|
| 32 |
+
## 🎯 Motivation and Creation Process
|
| 33 |
+
|
| 34 |
+
### Motivation
|
| 35 |
+
|
| 36 |
+
Our core research objective is to define and calculate the "density" of LLMs, which is the ratio of their effective parameter size to their actual parameter size. This process requires a precise evaluation of model performance on specific downstream tasks. We identified shortcomings in traditional evaluation methods for several key task types:
|
| 37 |
+
|
| 38 |
+
1. **Multiple-Choice Questions**: Calculating loss based only on the correct option's token ignores the model's understanding and analysis of the problem itself.
|
| 39 |
+
|
| 40 |
+
2. **Complex Problems (e.g., Mathematics)**: These tasks often require a complete chain of reasoning to arrive at a solution. Evaluating the entire process is more reflective of a model's true capabilities than evaluating the single-token answer.
|
| 41 |
+
|
| 42 |
+
3. **Code Generation Problems**: For programming tasks, evaluating code solely on functional correctness (i.e., whether it passes unit tests) is insufficient. This overlooks crucial aspects of code quality, such as algorithmic **efficiency** (e.g., `O(n log n)` vs. `O(n^2)`), **readability**, and adherence to best practices. A model might generate a brute-force solution that passes all tests but is highly inefficient. Assessing the model's high-level plan or algorithmic logic provides a more comprehensive measure of its coding intelligence.
|
| 43 |
+
|
| 44 |
+
### 🔧 Generation Process
|
| 45 |
+
|
| 46 |
+
To construct this dataset, we followed these steps:
|
| 47 |
+
|
| 48 |
+
1. **Base Datasets**: We selected `MMLU, BBH, MATH, MBPP, HUMAN-EVAL` as our foundation.
|
| 49 |
+
2. **Prompt Engineering**: For each test question, we designed appropriate prompts to elicit detailed reasoning.
|
| 50 |
+
3. **Reasoning Generation**: We used the **GPT-4o** API to generate coherent, step-by-step reasoning that leads to a final answer.
|
| 51 |
+
4. **Integration**: We integrated these generated reasoning steps with the original questions and answers to create the new, augmented data instances.
|
| 52 |
+
|
| 53 |
+
## ✅ Supported Tasks
|
| 54 |
+
|
| 55 |
+
This dataset is designed to enhance the evaluation of tasks such as:
|
| 56 |
+
|
| 57 |
+
* Mathematical Reasoning
|
| 58 |
+
* Code Reasoning
|
| 59 |
+
* Multiple-Choice Question Answering
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## ⚠️ Disclaimer
|
| 63 |
+
|
| 64 |
+
* The reasoning steps included in this dataset were automatically generated by **GPT-4o**. While we have made efforts to ensure their quality, we cannot guarantee that every reasoning process is entirely correct or flawless.
|
| 65 |
+
* For any given problem, the solution provided by GPT-4o represents only one of many possible reasoning paths and should not be considered the sole "correct" method.
|
| 66 |
+
* We encourage users to treat these reasoning steps as "soft" labels or references for evaluating a model's logical capabilities, rather than as absolute ground truth.
|
| 67 |
+
|
| 68 |
+
## 📜 License
|
| 69 |
+
|
| 70 |
+
This dataset is released under the `Apache 2.0` license.
|
| 71 |
+
|
| 72 |
+
## 📚 Citation
|
| 73 |
+
|
| 74 |
+
If you use this dataset in your research, please cite our paper:
|
| 75 |
+
|
| 76 |
+
```bibtex
|
| 77 |
+
@misc{xiao2024densinglawllms,
|
| 78 |
+
title={Densing Law of LLMs},
|
| 79 |
+
author={Chaojun Xiao and Jie Cai and Weilin Zhao and Guoyang Zeng and Biyuan Lin and Jie Zhou and Zhi Zheng and Xu Han and Zhiyuan Liu and Maosong Sun},
|
| 80 |
+
year={2024},
|
| 81 |
+
eprint={2412.04315},
|
| 82 |
+
archivePrefix={arXiv},
|
| 83 |
+
primaryClass={cs.AI},
|
| 84 |
+
url={https://arxiv.org/abs/2412.04315},
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
+
|