Add files using upload-large-folder tool
Browse files
README.md
CHANGED
|
@@ -1,3 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# DensingLaw-ScalingModels
|
| 2 |
|
| 3 |
This repository contains a series of reference models of varying sizes, released as part of our paper, **`Densing Law of LLMs`**. These models were trained to establish a robust scaling law, which serves as a foundational component for calculating the "density" of other Large Language Models (LLMs).
|
|
@@ -12,7 +75,7 @@ This repository contains a series of reference models of varying sizes, released
|
|
| 12 |
|
| 13 |
## 💡 Overview
|
| 14 |
|
| 15 |
-
The core contribution of our paper is the concept of **LLM Density**
|
| 16 |
|
| 17 |
The models in this repository serve as that "ruler". We trained a series of six models, ranging from **5 million to 800 million parameters**, on a consistent dataset. By measuring their loss on various benchmarks, we fitted a precise scaling function. This function allows us to take any other LLM, measure its performance, and infer its effective parameter size by seeing where it lands on our reference scale.
|
| 18 |
|
|
@@ -24,14 +87,14 @@ We trained six models with architectures designed for scaling. The detailed hype
|
|
| 24 |
|
| 25 |
#### Table 1: Detailed Hyper-parameters of Models for Loss Estimation
|
| 26 |
|
| 27 |
-
| Name |
|
| 28 |
| :----- | :------------ | :-- | :------ | :---- | :---- | :----- | :--- |
|
| 29 |
| 0.005B (S1) | 5,247,232 | 32 | 8 | 256 | 640 | 4 | 1 |
|
| 30 |
| 0.03B (S2) | 31,470,080 | 32 | 12 | 512 | 1,280 | 8 | 2 |
|
| 31 |
| 0.1B (S3) | 106,196,736 | 64 | 18 | 768 | 1,920 | 12 | 3 |
|
| 32 |
-
| 0.2B (S4) | 245,416,960 | 128 | 24 | 1,024 | 2,560 |
|
| 33 |
-
| 0.4B (S5) | 476,852,480 | 256 | 30 | 1,280 | 3,200 |
|
| 34 |
-
| 0.8B (S6) | 828,225,024 | 512 | 36 | 1,536 | 3,840 |
|
| 35 |
|
| 36 |
### Training Data
|
| 37 |
|
|
@@ -40,10 +103,18 @@ As stated in our paper, all reference models were trained on the **training corp
|
|
| 40 |
## 🎯 Research Context: The Densing Law
|
| 41 |
Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
|
| 42 |
|
| 43 |
-
1. **Loss Estimation \\( f_1 \\)**: We first establish the relationship between training compute (approximated as \\(C \approx 6ND\\) and conditional loss \\(\mathcal L\\) on downstream tasks. The models released in this repository are the data points used to fit this curve \\(\mathcal
|
| 44 |
-
2. **Performance Estimation \\(f_2\\)**: We then map the relationship between this loss \\(\mathcal
|
| 45 |
|
| 46 |
By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
<div align="center">
|
| 48 |
<img src="assets/fig.jpeg" width="800"/>
|
| 49 |
</div>
|
|
|
|
| 1 |
+
---
|
| 2 |
+
# 许可证标识符,必须是 Hugging Face 支持的 license ID
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
|
| 5 |
+
# 模型主要支持的语言
|
| 6 |
+
language:
|
| 7 |
+
- en
|
| 8 |
+
- zh
|
| 9 |
+
|
| 10 |
+
# 所使用的深度学习库
|
| 11 |
+
library_name: transformers
|
| 12 |
+
|
| 13 |
+
# 用于搜索和分类的标签
|
| 14 |
+
tags:
|
| 15 |
+
- text-generation
|
| 16 |
+
- scaling-laws
|
| 17 |
+
- densing-law
|
| 18 |
+
- reference-models
|
| 19 |
+
|
| 20 |
+
# 论文中用于评估的数据集
|
| 21 |
+
datasets:
|
| 22 |
+
- mmlu
|
| 23 |
+
- big-bench-hard
|
| 24 |
+
- math
|
| 25 |
+
- mbpp
|
| 26 |
+
- human-eval
|
| 27 |
+
|
| 28 |
+
# 论文中使用的评估指标
|
| 29 |
+
metrics:
|
| 30 |
+
- loss
|
| 31 |
+
- accuracy
|
| 32 |
+
|
| 33 |
+
# pipeline 标签,用于推理 API 和小部件
|
| 34 |
+
pipeline_tag: text-generation
|
| 35 |
+
|
| 36 |
+
# 使用 model-index 来列出仓库中包含的所有模型变体
|
| 37 |
+
model-index:
|
| 38 |
+
- name: DensingLaw-ScalingModel-0.005B
|
| 39 |
+
results: []
|
| 40 |
+
- name: DensingLaw-ScalingModel-0.03B
|
| 41 |
+
results: []
|
| 42 |
+
- name: DensingLaw-ScalingModel-0.1B
|
| 43 |
+
results: []
|
| 44 |
+
- name: DensingLaw-ScalingModel-0.2B
|
| 45 |
+
results: []
|
| 46 |
+
- name: DensingLaw-ScalingModel-0.4B
|
| 47 |
+
results: []
|
| 48 |
+
- name: DensingLaw-ScalingModel-0.8B
|
| 49 |
+
results: []
|
| 50 |
+
|
| 51 |
+
# 引用论文的 BibTeX
|
| 52 |
+
citation: |
|
| 53 |
+
@misc{xiao2024densinglawllms,
|
| 54 |
+
title={Densing Law of LLMs},
|
| 55 |
+
author={Chaojun Xiao and Jie Cai and Weilin Zhao and Guoyang Zeng and Biyuan Lin and Jie Zhou and Zhi Zheng and Xu Han and Zhiyuan Liu and Maosong Sun},
|
| 56 |
+
year={2024},
|
| 57 |
+
eprint={2412.04315},
|
| 58 |
+
archivePrefix={arXiv},
|
| 59 |
+
primaryClass={cs.AI},
|
| 60 |
+
url={https://arxiv.org/abs/2412.04315},
|
| 61 |
+
}
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
# DensingLaw-ScalingModels
|
| 65 |
|
| 66 |
This repository contains a series of reference models of varying sizes, released as part of our paper, **`Densing Law of LLMs`**. These models were trained to establish a robust scaling law, which serves as a foundational component for calculating the "density" of other Large Language Models (LLMs).
|
|
|
|
| 75 |
|
| 76 |
## 💡 Overview
|
| 77 |
|
| 78 |
+
The core contribution of our paper is the concept of **LLM Density** ($\rho$), defined as the ratio of a model's *effective* parameter size ($/ghat{N}$) to its *actual* parameter size ($N$). To accurately determine a model's effective size, we must first establish a reliable "ruler"—a scaling law that maps training compute to performance on downstream tasks.
|
| 79 |
|
| 80 |
The models in this repository serve as that "ruler". We trained a series of six models, ranging from **5 million to 800 million parameters**, on a consistent dataset. By measuring their loss on various benchmarks, we fitted a precise scaling function. This function allows us to take any other LLM, measure its performance, and infer its effective parameter size by seeing where it lands on our reference scale.
|
| 81 |
|
|
|
|
| 87 |
|
| 88 |
#### Table 1: Detailed Hyper-parameters of Models for Loss Estimation
|
| 89 |
|
| 90 |
+
| Name | \# Para | BS | n_layer | d | d_ffn | n_head | n_kv |
|
| 91 |
| :----- | :------------ | :-- | :------ | :---- | :---- | :----- | :--- |
|
| 92 |
| 0.005B (S1) | 5,247,232 | 32 | 8 | 256 | 640 | 4 | 1 |
|
| 93 |
| 0.03B (S2) | 31,470,080 | 32 | 12 | 512 | 1,280 | 8 | 2 |
|
| 94 |
| 0.1B (S3) | 106,196,736 | 64 | 18 | 768 | 1,920 | 12 | 3 |
|
| 95 |
+
| 0.2B (S4) | 245,416,960 | 128 | 24 | 1,024 | 2,560 | 64 | 16 |
|
| 96 |
+
| 0.4B (S5) | 476,852,480 | 256 | 30 | 1,280 | 3,200 | 64 | 20 |
|
| 97 |
+
| 0.8B (S6) | 828,225,024 | 512 | 36 | 1,536 | 3,840 | 64 | 24 |
|
| 98 |
|
| 99 |
### Training Data
|
| 100 |
|
|
|
|
| 103 |
## 🎯 Research Context: The Densing Law
|
| 104 |
Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
|
| 105 |
|
| 106 |
+
1. **Loss Estimation (\\( f_1 \\))**: We first establish the relationship between training compute (approximated as \\(C \approx 6ND\\) and conditional loss (\\(\mathcal L\\)) on downstream tasks. The models released in this repository are the data points used to fit this curve (\\(\mathcal{L} = f_1(C)\\)).
|
| 107 |
+
2. **Performance Estimation (\\(f_2\\))**: We then map the relationship between this loss (\\(\mathcal{L}\\)) and a more intuitive performance metric (\\(S\\)), such as accuracy (\\(S = f_2(\mathcal{L})\\)).
|
| 108 |
|
| 109 |
By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
|
| 110 |
+
|
| 111 |
+
Our framework for calculating LLM density involves a two-step estimation process, which is visualized below.
|
| 112 |
+
|
| 113 |
+
1. **Loss Estimation ($f_1$)**: We first establish the relationship between training compute (approximated as $C \approx 6ND$) and conditional loss ($\mathcal{L}$) on downstream tasks. The models released in this repository are the data points used to fit this curve ($\mathcal{L} = f_1(C)$).
|
| 114 |
+
2. **Performance Estimation ($f_2$)**: We then map the relationship between this loss ($\mathcal{L}$) and a more intuitive performance metric ($S$), such as accuracy ($S = f_2(\mathcal{L})$).
|
| 115 |
+
|
| 116 |
+
By combining these, we can determine the effective compute, and therefore the effective parameter size, for any model based on its performance.
|
| 117 |
+
|
| 118 |
<div align="center">
|
| 119 |
<img src="assets/fig.jpeg" width="800"/>
|
| 120 |
</div>
|