caijie12138 commited on
Commit
9dd08d9
·
verified ·
1 Parent(s): dcc19a3

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +31 -46
README.md CHANGED
@@ -1,77 +1,63 @@
1
- # DensingLaw-ScalingBench
2
 
3
- This dataset was created to enable a more accurate performance evaluation of Large Language Models (LLMs). It addresses the limitations of traditional evaluation methods—which often focus solely on the final answer—by providing detailed, GPT-4o generated reasoning steps (Chain-of-Thought) for each instance in benchmark test sets.
4
-
5
- This dataset is released as part of our paper, **`Densing Law of LLMs`**.
6
 
7
  <div align="center">
8
- <img src="assets/densinglaw.png" width="600"/>
9
  </div>
10
 
11
- <!-- <div align="center">
12
- English | [简体中文]()
13
- </div> -->
14
-
15
  <div align="center">
16
 
17
- [📜 Paper](https://arxiv.org/pdf/2412.04315)
18
- <!-- | [💻 Github Repo]() -->
19
-
20
- </div>
21
 
22
  ## 💡 Overview
23
 
24
- When evaluating Large Language Models (LLMs), especially on complex tasks like mathematical problems or multiple-choice questions, comparing only the final answers is often insufficient. A model might arrive at the correct answer for the wrong reasons, or make a minor calculation error in an otherwise correct reasoning process. The vast majority of traditional benchmark datasets lack annotations for these essential reasoning steps.
25
-
26
- To address this gap, we propose a more robust evaluation framework. As stated in our paper:
27
-
28
- > It is important to note that most datasets do not provide reasoning steps for each instance. For both two types of tasks, we use GPT-4o to generate reasoning steps for all test instances. These approaches allow us to better estimate the model’s performance by considering the specific requirements and formats of different tasks.
29
 
30
- This dataset is the direct result of that work. We leveraged the powerful `GPT-4o` model to generate high-quality reasoning steps for all test instances in `MMLU, BBH, MATH, MBPP, HUMAN-EVAL`, enabling researchers to conduct a deeper and more equitable analysis of a model's logical reasoning capabilities.
31
 
32
- ## 🎯 Motivation and Creation Process
33
 
34
- ### Motivation
35
 
36
- Our core research objective is to define and calculate the "density" of LLMs, which is the ratio of their effective parameter size to their actual parameter size. This process requires a precise evaluation of model performance on specific downstream tasks. We identified shortcomings in traditional evaluation methods for several key task types:
37
 
38
- 1. **Multiple-Choice Questions**: Calculating loss based only on the correct option's token ignores the model's understanding and analysis of the problem itself.
39
 
40
- 2. **Complex Problems (e.g., Mathematics)**: These tasks often require a complete chain of reasoning to arrive at a solution. Evaluating the entire process is more reflective of a model's true capabilities than evaluating the single-token answer.
 
 
 
 
 
 
 
41
 
42
- 3. **Code Generation Problems**: For programming tasks, evaluating code solely on functional correctness (i.e., whether it passes unit tests) is insufficient. This overlooks crucial aspects of code quality, such as algorithmic **efficiency** (e.g., `O(n log n)` vs. `O(n^2)`), **readability**, and adherence to best practices. A model might generate a brute-force solution that passes all tests but is highly inefficient. Assessing the model's high-level plan or algorithmic logic provides a more comprehensive measure of its coding intelligence.
43
 
44
- ### 🔧 Generation Process
45
 
46
- To construct this dataset, we followed these steps:
47
 
48
- 1. **Base Datasets**: We selected `MMLU, BBH, MATH, MBPP, HUMAN-EVAL` as our foundation.
49
- 2. **Prompt Engineering**: For each test question, we designed appropriate prompts to elicit detailed reasoning.
50
- 3. **Reasoning Generation**: We used the **GPT-4o** API to generate coherent, step-by-step reasoning that leads to a final answer.
51
- 4. **Integration**: We integrated these generated reasoning steps with the original questions and answers to create the new, augmented data instances.
52
 
53
- ## Supported Tasks
 
54
 
55
- This dataset is designed to enhance the evaluation of tasks such as:
56
 
57
- * Mathematical Reasoning
58
- * Code Reasoning
59
- * Multiple-Choice Question Answering
60
-
61
-
62
- ## ⚠️ Disclaimer
63
-
64
- * The reasoning steps included in this dataset were automatically generated by **GPT-4o**. While we have made efforts to ensure their quality, we cannot guarantee that every reasoning process is entirely correct or flawless.
65
- * For any given problem, the solution provided by GPT-4o represents only one of many possible reasoning paths and should not be considered the sole "correct" method.
66
- * We encourage users to treat these reasoning steps as "soft" labels or references for evaluating a model's logical capabilities, rather than as absolute ground truth.
67
 
68
  ## 📜 License
69
 
70
- This dataset is released under the `Apache 2.0` license.
71
 
72
  ## 📚 Citation
73
 
74
- If you use this dataset in your research, please cite our paper:
75
 
76
  ```bibtex
77
  @misc{xiao2024densinglawllms,
@@ -84,4 +70,3 @@ If you use this dataset in your research, please cite our paper:
84
  url={https://arxiv.org/abs/2412.04315},
85
  }
86
  ```
87
-
 
1
+ # DensingLaw-ScalingModels
2
 
3
+ This repository contains a series of reference models of varying sizes, released as part of our paper, **`Densing Law of LLMs`**. These models were trained to establish a robust scaling law, which serves as a foundational component for calculating the "density" of other Large Language Models (LLMs).
 
 
4
 
5
  <div align="center">
6
+ <img src="assets/densinglaw.png" width="600"/>
7
  </div>
8
 
 
 
 
 
9
  <div align="center">
10
 
11
+ [📜 Paper](https://arxiv.org/abs/2412.04315) | [🤗 Hugging Face Models](https://huggingface.co/openbmb/DensingLaw-ScalingModels) </div>
 
 
 
12
 
13
  ## 💡 Overview
14
 
15
+ The core contribution of our paper is the concept of **LLM Density** ($\rho$), defined as the ratio of a model's *effective* parameter size ($/ghat{N}$) to its *actual* parameter size ($N$). To accurately determine a model's effective size, we must first establish a reliable "ruler"—a scaling law that maps training compute to performance on downstream tasks.
 
 
 
 
16
 
17
+ The models in this repository serve as that "ruler". We trained a series of six models, ranging from **5 million to 800 million parameters**, on a consistent dataset. By measuring their loss on various benchmarks, we fitted a precise scaling function. This function allows us to take any other LLM, measure its performance, and infer its effective parameter size by seeing where it lands on our reference scale.
18
 
19
+ These models are released to allow researchers to verify our results, build upon our work, and use this established scale for their own model evaluations.
20
 
21
+ ## 🔬 The Models
22
 
23
+ We trained six models with architectures designed for scaling. The detailed hyperparameters are listed below.
24
 
25
+ #### Table 1: Detailed Hyper-parameters of Models for Loss Estimation
26
 
27
+ | Name | \# Para | BS | n_layer | d | d_ffn | n_head | n_kv |
28
+ | :----- | :------------ | :-- | :------ | :---- | :---- | :----- | :--- |
29
+ | 0.005B (S1) | 5,247,232 | 32 | 8 | 256 | 640 | 4 | 1 |
30
+ | 0.03B (S2) | 31,470,080 | 32 | 12 | 512 | 1,280 | 8 | 2 |
31
+ | 0.1B (S3) | 106,196,736 | 64 | 18 | 768 | 1,920 | 12 | 3 |
32
+ | 0.2B (S4) | 245,416,960 | 128 | 24 | 1,024 | 2,560 | 64 | 16 |
33
+ | 0.4B (S5) | 476,852,480 | 256 | 30 | 1,280 | 3,200 | 64 | 20 |
34
+ | 0.8B (S6) | 828,225,024 | 512 | 36 | 1,536 | 3,840 | 64 | 24 |
35
 
36
+ ### Training Data
37
 
38
+ As stated in our paper, all reference models were trained on the **training corpus of MiniCPM-3-4B** (Hu et al., 2024) to ensure consistency.
39
 
40
+ ## 🎯 Research Context: The Densing Law
41
 
42
+ Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
 
 
 
43
 
44
+ 1. **Loss Estimation ($f_1$)**: We first establish the relationship between training compute (approximated as $C \approx 6ND$) and conditional loss ($/gmathcal{L}$) on downstream tasks. The models released in this repository are the data points used to fit this curve ($\\mathcal{L} = f_1(C)$).
45
+ 2. **Performance Estimation ($f_2$)**: We then map the relationship between this loss ($\mathcal{L}$) and a more intuitive performance metric ($S$), such as accuracy ($S = f_2(/gmathcal{L})$).
46
 
47
+ By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
48
 
49
+ <div align="center">
50
+ <img src="assets/fig2.png" width="800"/>
51
+ <p><b>Figure 2:</b\> Results for the (a) loss estimation and (b) performance estimation processes. The purple line represents our fitted scaling law, derived from the reference models (colored dots).</p\>
52
+ </div>
 
 
 
 
 
 
53
 
54
  ## 📜 License
55
 
56
+ This work is released under the `Apache 2.0` license.
57
 
58
  ## 📚 Citation
59
 
60
+ If you use our models or the Densing Law concept in your research, please cite our paper:
61
 
62
  ```bibtex
63
  @misc{xiao2024densinglawllms,
 
70
  url={https://arxiv.org/abs/2412.04315},
71
  }
72
  ```