caijie12138 commited on
Commit
e62bd65
·
verified ·
1 Parent(s): 49c22e3

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DensingLaw-ScalingBench
2
+
3
+ This dataset was created to enable a more accurate performance evaluation of Large Language Models (LLMs). It addresses the limitations of traditional evaluation methods—which often focus solely on the final answer—by providing detailed, GPT-4o generated reasoning steps (Chain-of-Thought) for each instance in benchmark test sets.
4
+
5
+ This dataset is released as part of our paper, **`Densing Law of LLMs`**.
6
+
7
+ <div align="center">
8
+ <img src="assets/densinglaw.png" width="600"/>
9
+ </div>
10
+
11
+ <!-- <div align="center">
12
+ English | [简体中文]()
13
+ </div> -->
14
+
15
+ <div align="center">
16
+
17
+ [📜 Paper](https://arxiv.org/pdf/2412.04315)
18
+ <!-- | [💻 Github Repo]() -->
19
+
20
+ </div>
21
+
22
+ ## 💡 Overview
23
+
24
+ When evaluating Large Language Models (LLMs), especially on complex tasks like mathematical problems or multiple-choice questions, comparing only the final answers is often insufficient. A model might arrive at the correct answer for the wrong reasons, or make a minor calculation error in an otherwise correct reasoning process. The vast majority of traditional benchmark datasets lack annotations for these essential reasoning steps.
25
+
26
+ To address this gap, we propose a more robust evaluation framework. As stated in our paper:
27
+
28
+ > It is important to note that most datasets do not provide reasoning steps for each instance. For both two types of tasks, we use GPT-4o to generate reasoning steps for all test instances. These approaches allow us to better estimate the model’s performance by considering the specific requirements and formats of different tasks.
29
+
30
+ This dataset is the direct result of that work. We leveraged the powerful `GPT-4o` model to generate high-quality reasoning steps for all test instances in `MMLU, BBH, MATH, MBPP, HUMAN-EVAL`, enabling researchers to conduct a deeper and more equitable analysis of a model's logical reasoning capabilities.
31
+
32
+ ## 🎯 Motivation and Creation Process
33
+
34
+ ### Motivation
35
+
36
+ Our core research objective is to define and calculate the "density" of LLMs, which is the ratio of their effective parameter size to their actual parameter size. This process requires a precise evaluation of model performance on specific downstream tasks. We identified shortcomings in traditional evaluation methods for several key task types:
37
+
38
+ 1. **Multiple-Choice Questions**: Calculating loss based only on the correct option's token ignores the model's understanding and analysis of the problem itself.
39
+
40
+ 2. **Complex Problems (e.g., Mathematics)**: These tasks often require a complete chain of reasoning to arrive at a solution. Evaluating the entire process is more reflective of a model's true capabilities than evaluating the single-token answer.
41
+
42
+ 3. **Code Generation Problems**: For programming tasks, evaluating code solely on functional correctness (i.e., whether it passes unit tests) is insufficient. This overlooks crucial aspects of code quality, such as algorithmic **efficiency** (e.g., `O(n log n)` vs. `O(n^2)`), **readability**, and adherence to best practices. A model might generate a brute-force solution that passes all tests but is highly inefficient. Assessing the model's high-level plan or algorithmic logic provides a more comprehensive measure of its coding intelligence.
43
+
44
+ ### 🔧 Generation Process
45
+
46
+ To construct this dataset, we followed these steps:
47
+
48
+ 1. **Base Datasets**: We selected `MMLU, BBH, MATH, MBPP, HUMAN-EVAL` as our foundation.
49
+ 2. **Prompt Engineering**: For each test question, we designed appropriate prompts to elicit detailed reasoning.
50
+ 3. **Reasoning Generation**: We used the **GPT-4o** API to generate coherent, step-by-step reasoning that leads to a final answer.
51
+ 4. **Integration**: We integrated these generated reasoning steps with the original questions and answers to create the new, augmented data instances.
52
+
53
+ ## ✅ Supported Tasks
54
+
55
+ This dataset is designed to enhance the evaluation of tasks such as:
56
+
57
+ * Mathematical Reasoning
58
+ * Code Reasoning
59
+ * Multiple-Choice Question Answering
60
+
61
+
62
+ ## ⚠️ Disclaimer
63
+
64
+ * The reasoning steps included in this dataset were automatically generated by **GPT-4o**. While we have made efforts to ensure their quality, we cannot guarantee that every reasoning process is entirely correct or flawless.
65
+ * For any given problem, the solution provided by GPT-4o represents only one of many possible reasoning paths and should not be considered the sole "correct" method.
66
+ * We encourage users to treat these reasoning steps as "soft" labels or references for evaluating a model's logical capabilities, rather than as absolute ground truth.
67
+
68
+ ## 📜 License
69
+
70
+ This dataset is released under the `Apache 2.0` license.
71
+
72
+ ## 📚 Citation
73
+
74
+ If you use this dataset in your research, please cite our paper:
75
+
76
+ ```bibtex
77
+ @misc{xiao2024densinglawllms,
78
+ title={Densing Law of LLMs},
79
+ author={Chaojun Xiao and Jie Cai and Weilin Zhao and Guoyang Zeng and Biyuan Lin and Jie Zhou and Zhi Zheng and Xu Han and Zhiyuan Liu and Maosong Sun},
80
+ year={2024},
81
+ eprint={2412.04315},
82
+ archivePrefix={arXiv},
83
+ primaryClass={cs.AI},
84
+ url={https://arxiv.org/abs/2412.04315},
85
+ }
86
+ ```
87
+