caijie12138 commited on
Commit
f222b28
·
verified ·
1 Parent(s): dcc9128

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +7 -16
README.md CHANGED
@@ -60,7 +60,6 @@ citation: |
60
  url={https://arxiv.org/abs/2412.04315},
61
  }
62
  ---
63
-
64
  # DensingLaw-ScalingModels
65
 
66
  This repository contains a series of reference models of varying sizes, released as part of our paper, **`Densing Law of LLMs`**. These models were trained to establish a robust scaling law, which serves as a foundational component for calculating the "density" of other Large Language Models (LLMs).
@@ -75,7 +74,7 @@ This repository contains a series of reference models of varying sizes, released
75
 
76
  ## 💡 Overview
77
 
78
- The core contribution of our paper is the concept of **LLM Density** ($\rho$), defined as the ratio of a model's *effective* parameter size ($/ghat{N}$) to its *actual* parameter size ($N$). To accurately determine a model's effective size, we must first establish a reliable "ruler"—a scaling law that maps training compute to performance on downstream tasks.
79
 
80
  The models in this repository serve as that "ruler". We trained a series of six models, ranging from **5 million to 800 million parameters**, on a consistent dataset. By measuring their loss on various benchmarks, we fitted a precise scaling function. This function allows us to take any other LLM, measure its performance, and infer its effective parameter size by seeing where it lands on our reference scale.
81
 
@@ -87,14 +86,14 @@ We trained six models with architectures designed for scaling. The detailed hype
87
 
88
  #### Table 1: Detailed Hyper-parameters of Models for Loss Estimation
89
 
90
- | Name | \# Para | BS | n_layer | d | d_ffn | n_head | n_kv |
91
  | :----- | :------------ | :-- | :------ | :---- | :---- | :----- | :--- |
92
  | 0.005B (S1) | 5,247,232 | 32 | 8 | 256 | 640 | 4 | 1 |
93
  | 0.03B (S2) | 31,470,080 | 32 | 12 | 512 | 1,280 | 8 | 2 |
94
  | 0.1B (S3) | 106,196,736 | 64 | 18 | 768 | 1,920 | 12 | 3 |
95
- | 0.2B (S4) | 245,416,960 | 128 | 24 | 1,024 | 2,560 | 64 | 16 |
96
- | 0.4B (S5) | 476,852,480 | 256 | 30 | 1,280 | 3,200 | 64 | 20 |
97
- | 0.8B (S6) | 828,225,024 | 512 | 36 | 1,536 | 3,840 | 64 | 24 |
98
 
99
  ### Training Data
100
 
@@ -103,18 +102,10 @@ As stated in our paper, all reference models were trained on the **training corp
103
  ## 🎯 Research Context: The Densing Law
104
  Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
105
 
106
- 1. **Loss Estimation (\\( f_1 \\))**: We first establish the relationship between training compute (approximated as \\(C \approx 6ND\\) and conditional loss (\\(\mathcal L\\)) on downstream tasks. The models released in this repository are the data points used to fit this curve (\\(\mathcal{L} = f_1(C)\\)).
107
- 2. **Performance Estimation (\\(f_2\\))**: We then map the relationship between this loss (\\(\mathcal{L}\\)) and a more intuitive performance metric (\\(S\\)), such as accuracy (\\(S = f_2(\mathcal{L})\\)).
108
 
109
  By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
110
-
111
- Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
112
-
113
- 1. **Loss Estimation ($f_1$)**: We first establish the relationship between training compute (approximated as $C \approx 6ND$) and conditional loss ($\mathcal{L}$) on downstream tasks. The models released in this repository are the data points used to fit this curve ($\mathcal{L} = f_1(C)$).
114
- 2. **Performance Estimation ($f_2$)**: We then map the relationship between this loss ($\mathcal{L}$) and a more intuitive performance metric ($S$), such as accuracy ($S = f_2(\mathcal{L})$).
115
-
116
- By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
117
-
118
  <div align="center">
119
  <img src="assets/fig.jpeg" width="800"/>
120
  </div>
 
60
  url={https://arxiv.org/abs/2412.04315},
61
  }
62
  ---
 
63
  # DensingLaw-ScalingModels
64
 
65
  This repository contains a series of reference models of varying sizes, released as part of our paper, **`Densing Law of LLMs`**. These models were trained to establish a robust scaling law, which serves as a foundational component for calculating the "density" of other Large Language Models (LLMs).
 
74
 
75
  ## 💡 Overview
76
 
77
+ The core contribution of our paper is the concept of **LLM Density** \\(\rho\\), defined as the ratio of a model's *effective* parameter size \\(\hat{N}\\) to its *actual* parameter size \\(N\\). To accurately determine a model's effective size, we must first establish a reliable "ruler"—a scaling law that maps training compute to performance on downstream tasks.
78
 
79
  The models in this repository serve as that "ruler". We trained a series of six models, ranging from **5 million to 800 million parameters**, on a consistent dataset. By measuring their loss on various benchmarks, we fitted a precise scaling function. This function allows us to take any other LLM, measure its performance, and infer its effective parameter size by seeing where it lands on our reference scale.
80
 
 
86
 
87
  #### Table 1: Detailed Hyper-parameters of Models for Loss Estimation
88
 
89
+ | Name | # Para | BS | n_layer | d | d_ffn | n_head | n_kv |
90
  | :----- | :------------ | :-- | :------ | :---- | :---- | :----- | :--- |
91
  | 0.005B (S1) | 5,247,232 | 32 | 8 | 256 | 640 | 4 | 1 |
92
  | 0.03B (S2) | 31,470,080 | 32 | 12 | 512 | 1,280 | 8 | 2 |
93
  | 0.1B (S3) | 106,196,736 | 64 | 18 | 768 | 1,920 | 12 | 3 |
94
+ | 0.2B (S4) | 245,416,960 | 128 | 24 | 1,024 | 2,560 | 16 | 2 |
95
+ | 0.4B (S5) | 476,852,480 | 256 | 30 | 1,280 | 3,200 | 20 | 2 |
96
+ | 0.8B (S6) | 828,225,024 | 512 | 36 | 1,536 | 3,840 | 24 | 3 |
97
 
98
  ### Training Data
99
 
 
102
  ## 🎯 Research Context: The Densing Law
103
  Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
104
 
105
+ 1. **Loss Estimation \\( f_1 \\)**: We first establish the relationship between training compute (approximated as \\(C \approx 6ND\\) and conditional loss \\(\mathcal L\\) on downstream tasks. The models released in this repository are the data points used to fit this curve \\(\mathcal L = f_1(C)\\).
106
+ 2. **Performance Estimation \\(f_2\\)**: We then map the relationship between this loss \\(\mathcal L\\) and a more intuitive performance metric \\(S\\), such as accuracy \\(S = f_2(\mathcal L)\\).
107
 
108
  By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
 
 
 
 
 
 
 
 
109
  <div align="center">
110
  <img src="assets/fig.jpeg" width="800"/>
111
  </div>