openbmb
/

DensingLaw-ScalingModels

@@ -1,77 +1,63 @@
-# DensingLaw-ScalingBench
-This dataset was created to enable a more accurate performance evaluation of Large Language Models (LLMs). It addresses the limitations of traditional evaluation methods—which often focus solely on the final answer—by providing detailed, GPT-4o generated reasoning steps (Chain-of-Thought) for each instance in benchmark test sets.
-This dataset is released as part of our paper, **`Densing Law of LLMs`**.
 <div align="center">
-  <img src="assets/densinglaw.png" width="600"/>
 </div>
-<!-- <div align="center">
-English | [简体中文]()
-</div> -->
 <div align="center">
-[📜 Paper](https://arxiv.org/pdf/2412.04315)
-<!-- | [💻 Github Repo]() -->
-</div>
 ## 💡 Overview
-When evaluating Large Language Models (LLMs), especially on complex tasks like mathematical problems or multiple-choice questions, comparing only the final answers is often insufficient. A model might arrive at the correct answer for the wrong reasons, or make a minor calculation error in an otherwise correct reasoning process. The vast majority of traditional benchmark datasets lack annotations for these essential reasoning steps.
-To address this gap, we propose a more robust evaluation framework. As stated in our paper:
-> It is important to note that most datasets do not provide reasoning steps for each instance. For both two types of tasks, we use GPT-4o to generate reasoning steps for all test instances. These approaches allow us to better estimate the model’s performance by considering the specific requirements and formats of different tasks.
-This dataset is the direct result of that work. We leveraged the powerful `GPT-4o` model to generate high-quality reasoning steps for all test instances in `MMLU, BBH, MATH, MBPP, HUMAN-EVAL`, enabling researchers to conduct a deeper and more equitable analysis of a model's logical reasoning capabilities.
-## 🎯 Motivation and Creation Process
-### Motivation
-Our core research objective is to define and calculate the "density" of LLMs, which is the ratio of their effective parameter size to their actual parameter size. This process requires a precise evaluation of model performance on specific downstream tasks. We identified shortcomings in traditional evaluation methods for several key task types:
-1.  **Multiple-Choice Questions**: Calculating loss based only on the correct option's token ignores the model's understanding and analysis of the problem itself.
-2.  **Complex Problems (e.g., Mathematics)**: These tasks often require a complete chain of reasoning to arrive at a solution. Evaluating the entire process is more reflective of a model's true capabilities than evaluating the single-token answer.
-3.  **Code Generation Problems**: For programming tasks, evaluating code solely on functional correctness (i.e., whether it passes unit tests) is insufficient. This overlooks crucial aspects of code quality, such as algorithmic **efficiency** (e.g., `O(n log n)` vs. `O(n^2)`), **readability**, and adherence to best practices. A model might generate a brute-force solution that passes all tests but is highly inefficient. Assessing the model's high-level plan or algorithmic logic provides a more comprehensive measure of its coding intelligence.
-### 🔧 Generation Process
-To construct this dataset, we followed these steps:
-1.  **Base Datasets**: We selected `MMLU, BBH, MATH, MBPP, HUMAN-EVAL` as our foundation.
-2.  **Prompt Engineering**: For each test question, we designed appropriate prompts to elicit detailed reasoning.
-3.  **Reasoning Generation**: We used the **GPT-4o** API to generate coherent, step-by-step reasoning that leads to a final answer.
-4.  **Integration**: We integrated these generated reasoning steps with the original questions and answers to create the new, augmented data instances.
-## ✅ Supported Tasks
-This dataset is designed to enhance the evaluation of tasks such as:
-  * Mathematical Reasoning
-  * Code Reasoning
-  * Multiple-Choice Question Answering
-## ⚠️ Disclaimer
-  * The reasoning steps included in this dataset were automatically generated by **GPT-4o**. While we have made efforts to ensure their quality, we cannot guarantee that every reasoning process is entirely correct or flawless.
-  * For any given problem, the solution provided by GPT-4o represents only one of many possible reasoning paths and should not be considered the sole "correct" method.
-  * We encourage users to treat these reasoning steps as "soft" labels or references for evaluating a model's logical capabilities, rather than as absolute ground truth.
 ## 📜 License
-This dataset is released under the `Apache 2.0` license.
 ## 📚 Citation
-If you use this dataset in your research, please cite our paper:
 ```bibtex
 @misc{xiao2024densinglawllms,
@@ -84,4 +70,3 @@ If you use this dataset in your research, please cite our paper:
       url={https://arxiv.org/abs/2412.04315},
 }
 ```

+# DensingLaw-ScalingModels
+This repository contains a series of reference models of varying sizes, released as part of our paper, **`Densing Law of LLMs`**. These models were trained to establish a robust scaling law, which serves as a foundational component for calculating the "density" of other Large Language Models (LLMs).
 <div align="center">
+<img src="assets/densinglaw.png" width="600"/>
 </div>
 <div align="center">
+[📜 Paper](https://arxiv.org/abs/2412.04315) | [🤗 Hugging Face Models](https://huggingface.co/openbmb/DensingLaw-ScalingModels) </div>
 ## 💡 Overview
+The core contribution of our paper is the concept of **LLM Density** ($\rho$), defined as the ratio of a model's *effective* parameter size ($/ghat{N}$) to its *actual* parameter size ($N$). To accurately determine a model's effective size, we must first establish a reliable "ruler"—a scaling law that maps training compute to performance on downstream tasks.
+The models in this repository serve as that "ruler". We trained a series of six models, ranging from **5 million to 800 million parameters**, on a consistent dataset. By measuring their loss on various benchmarks, we fitted a precise scaling function. This function allows us to take any other LLM, measure its performance, and infer its effective parameter size by seeing where it lands on our reference scale.
+These models are released to allow researchers to verify our results, build upon our work, and use this established scale for their own model evaluations.
+## 🔬 The Models
+We trained six models with architectures designed for scaling. The detailed hyperparameters are listed below.
+#### Table 1: Detailed Hyper-parameters of Models for Loss Estimation
+| Name   | \# Para        | BS  | n_layer | d     | d_ffn | n_head | n_kv |
+| :----- | :------------ | :-- | :------ | :---- | :---- | :----- | :--- |
+| 0.005B (S1) | 5,247,232     | 32  | 8       | 256   | 640   | 4      | 1    |
+| 0.03B (S2)  | 31,470,080    | 32  | 12      | 512   | 1,280 | 8      | 2    |
+| 0.1B (S3)  | 106,196,736   | 64  | 18      | 768   | 1,920 | 12     | 3    |
+| 0.2B (S4)  | 245,416,960   | 128 | 24      | 1,024 | 2,560 | 64     | 16   |
+| 0.4B (S5)  | 476,852,480   | 256 | 30      | 1,280 | 3,200 | 64     | 20   |
+| 0.8B (S6)  | 828,225,024   | 512 | 36      | 1,536 | 3,840 | 64     | 24   |
+### Training Data
+As stated in our paper, all reference models were trained on the **training corpus of MiniCPM-3-4B** (Hu et al., 2024) to ensure consistency.
+## 🎯 Research Context: The Densing Law
+Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
+1.  **Loss Estimation ($f_1$)**: We first establish the relationship between training compute (approximated as $C \approx 6ND$) and conditional loss ($/gmathcal{L}$) on downstream tasks. The models released in this repository are the data points used to fit this curve ($\\mathcal{L} = f_1(C)$).
+2.  **Performance Estimation ($f_2$)**: We then map the relationship between this loss ($\mathcal{L}$) and a more intuitive performance metric ($S$), such as accuracy ($S = f_2(/gmathcal{L})$).
+By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
+<div align="center">
+<img src="assets/fig2.png" width="800"/>
+<p><b>Figure 2:</b\> Results for the (a) loss estimation and (b) performance estimation processes. The purple line represents our fitted scaling law, derived from the reference models (colored dots).</p\>
+</div>
 ## 📜 License
+This work is released under the `Apache 2.0` license.
 ## 📚 Citation
+If you use our models or the Densing Law concept in your research, please cite our paper:
 ```bibtex
 @misc{xiao2024densinglawllms,
       url={https://arxiv.org/abs/2412.04315},
 }
 ```