Files changed (1) hide show
  1. README.md +151 -2
README.md CHANGED
@@ -2,9 +2,12 @@
2
  license: apache-2.0
3
  datasets:
4
  - HPLT/HPLT3.0
5
- - allenai/MADLAD-400
6
  - allenai/olmo-mix-1124
7
  - HuggingFaceFW/finepdfs
 
 
 
 
8
  language:
9
  - nb
10
  - nn
@@ -18,13 +21,159 @@ tags:
18
  - HPLT
19
  ---
20
 
 
 
21
  ![](https://hplt-project.org/_next/static/media/logo-hplt.8765d2d4.svg)
22
 
23
  This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
24
 
25
- Our training data mixture included [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål and Nynorsk, FinePDF Bokmål and Nynorsk, MADLAD400 Norwegian, OLMo-Mix, Northern Sami dataset.
26
  The model was trained for 33 000 steps on around 300 billion tokens. Intermediate checkpoints are published here as branches.
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  Training was conducted as a part of the [HPLT project](https://hplt-project.org/).
29
 
30
  _This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]_
 
2
  license: apache-2.0
3
  datasets:
4
  - HPLT/HPLT3.0
 
5
  - allenai/olmo-mix-1124
6
  - HuggingFaceFW/finepdfs
7
+ - HuggingFaceTB/finemath
8
+ - LLM360/MegaMath
9
+ - HuggingFaceTB/stack-edu
10
+ - HuggingFaceFW/finepdfs-edu
11
  language:
12
  - nb
13
  - nn
 
21
  - HPLT
22
  ---
23
 
24
+ # NorOLMo
25
+
26
  ![](https://hplt-project.org/_next/static/media/logo-hplt.8765d2d4.svg)
27
 
28
  This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
29
 
 
30
  The model was trained for 33 000 steps on around 300 billion tokens. Intermediate checkpoints are published here as branches.
31
 
32
+ ## Data Details
33
+
34
+ ### Stage 1 (24 000 steps -- 200B tokens)
35
+
36
+ Data
37
+ - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
38
+ - FinePDFs Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
39
+ - OLMo-Mix
40
+ - Northern Sami
41
+
42
+ Data Splits
43
+ | Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
44
+ | ------------------------ | ---------- | ------------- | ------------ | ------------------- | ----------------------- |
45
+ | HPLT Bokmål | 39.57 | 39.8B | 79.7B | 36.5M | 1 092 |
46
+ | HPLT Nynorsk | 4.95 | 1.2B | 10.0B | 1.5M | 826 |
47
+ | HPLT Faroese | 0.46 | 0.2B | 0.9B | 0.3M | 711 |
48
+ | HPLT Icelandic | 2.50 | 5.0B | 5.0B | 4.3M | 1 173 |
49
+ | HPLT Swedish | 12.09 | 92.1B | 24.4B | 97.7M | 942 |
50
+ | HPLT Danish | 12.12 | 50.1B | 24.4B | 52.5M | 954 |
51
+ | FinePDFs Bokmål | 8.36 | 8.4B | 16.8B | 1.5M | 5 604 |
52
+ | FinePDFs Nynorsk | 1.15 | 0.3B | 2.3B | 92.8K | 3 117 |
53
+ | FinePDFs Faroese | 0.17 | 87.1M | 0.3B | 20.8K | 4 196 |
54
+ | FinePDFs Icelandic | 1.60 | 3.2B | 3.2B | 0.4M | 8 855 |
55
+ | FinePDFs Swedish | 2.48 | 18.9B | 5.0B | 4.1M | 4 574 |
56
+ | FinePDFs Danish | 2.45 | 10.1B | 4.9B | 2.4M | 4 190 |
57
+ | Northern Sami | 0.18 | 46.4M | 0.4B | 0.2M | 288 |
58
+ | Wiki (OLMo-Mix) | 0.02 | 0.2B | 40.3M | 36.5M | 667 |
59
+ | Alg. Stack (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 36.5M | 4 201 |
60
+ | Open Web Math (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 36.5M | 4 199 |
61
+ | ArXiv (OLMo-Mix) | 0.05 | 1.0B | 0.1B | 36.5M | 5 210 |
62
+ | PeS2o (OLMo-Mix) | 0.15 | 2.5B | 0.3B | 36.5M | 1 641 |
63
+ | DCLM (OLMo-Mix) | 9.50 | 48.3B | 19.1B | 36.5M | 1 377 |
64
+ | StarCoder (OLMo-Mix) | 2.10 | 30.5B | 4.2B | 36.5M | 1 293 |
65
+
66
+ > [!NOTE]
67
+ > The number of documents represents the total unique number of documents, not the documents used during training.
68
+
69
+ > [!NOTE]
70
+ > We only took a portion of DCLM as our unique data.
71
+
72
+ ### Stage 2 (6 000 steps -- 50B tokens)
73
+
74
+ Data
75
+ - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Bokmål, Nynorsk, Icelandic, Danish, Swedish
76
+ - FinePDFs-Edu Bokmål, Nynorsk, Icelandic, Danish, Swedish, English
77
+ - FindePDFs Faroese
78
+ - Northern Sami
79
+ - Stack-Edu
80
+ - MegaMath Web-Pro
81
+ - FineMath 4+
82
+ - InfiWebMath 4+
83
+
84
+ Data Splits
85
+ | Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
86
+ | ------------------------ | ---------- | ------------- | ------------ | ------------------- | ----------------------- |
87
+ | HPLT Bokmål | 45.78 | 23.0B | 23.0B | 19.0M | 1 215 |
88
+ | HPLT Nynorsk | 7.84 | 1.0B | 3.9B | 1.0M | 1 003 |
89
+ | HPLT Icelandic | 6.87 | 3.5B | 3.5B | 2.7M | 1 268 |
90
+ | HPLT Swedish | 4.90 | 2.5B | 2.5B | 3.6M | 3 403 |
91
+ | HPLT Danish | 7.73 | 3.9B | 3.9B | 4.1M | 2 950 |
92
+ | FinePDFs-Edu Bokmål | 2.24 | 1.1B | 1.1B | 0.2M | 6 897 |
93
+ | FinePDFs-Edu Nynorsk | 0.28 | 35.8M | 0.1B | 9.7K | 3 681 |
94
+ | FinePDFs Faroese | 0.69 | 87.1M | 0.3B | 20.8K | 4 196 |
95
+ | FinePDFs-Edu Icelandic | 0.53 | 0.3B | 0.3B | 40.1K | 6 598 |
96
+ | FinePDFs-Edu Swedish | 5.80 | 2.9B | 2.9B | 0.4M | 6 755 |
97
+ | FinePDFs-Edu Danish | 2.97 | 1.5B | 1.5B | 0.3M | 5 833 |
98
+ | FinePDFs-Edu English | 7.00 | 7.2B | 3.5B | 1.1M | 6 280 |
99
+ | Northern Sami | 0.37 | 46.4M | 0.2B | 0.2M | 288 |
100
+ | Stack-Edu | 5.00 | 12.8B | 2.5B | 15.0M | 856 |
101
+ | MegaMath Web-Pro | 0.84 | 13.7B | 0.4B | 15.0M | 917 |
102
+ | FineMath 4+ | 0.62 | 10.1B | 0.3B | 6.7M | 1 512 |
103
+ | InfiWebMath 4+ | 0.54 | 8.9B | 0.3B | 6.3M | 1 417 |
104
+
105
+ ### Stage 2-continued (3 000 steps -- 25B tokens)
106
+
107
+ Same data as for stage 2 but using half the total tokens.
108
+
109
+ ## Training details
110
+
111
+ ### Stage 1
112
+
113
+ | Hyperparameter | Value |
114
+ | ------------------------ | ----------------- |
115
+ | Embedding train steps | 1 000 |
116
+ | Warmup steps | 2 000 |
117
+ | Total train steps | 24 000 |
118
+ | Learning rate schedule | Warmup + constant |
119
+ | Learning rate | 3e-4 |
120
+ | Weight decay | 1e-1 |
121
+ | Sequence length | 4 096 |
122
+ | Batch size | 2 048 |
123
+ | RoPe theta | 500 000 |
124
+ | Clip grad | 1.0 |
125
+ | Adam epsilon | 1e-8 |
126
+ | Adam beta_1 | 0.9 |
127
+ | Adam beta_2 | 0.95 |
128
+ | RMSNorm epsilon | 1e-6 |
129
+ | Z-loss ratio | 1e-5 |
130
+ | Diffusion loss ratio | 2e-2 |
131
+
132
+ ### Stage 2
133
+
134
+ | Hyperparameter | Value |
135
+ | ------------------------ | ----------------- |
136
+ | Decay steps | 6 000 |
137
+ | Total train steps | 6 000 |
138
+ | Learning rate schedule | Linear decay |
139
+ | Initial learning rate | 3e-4 |
140
+ | Final learning rate | 0 |
141
+ | Weight decay | 1e-1 |
142
+ | Sequence length | 16 364 |
143
+ | Batch size | 512 |
144
+ | RoPe theta | 2 000 000 |
145
+ | Clip grad | 1.0 |
146
+ | Adam epsilon | 1e-8 |
147
+ | Adam beta_1 | 0.9 |
148
+ | Adam beta_2 | 0.95 |
149
+ | RMSNorm epsilon | 1e-6 |
150
+ | Z-loss ratio | 1e-5 |
151
+ | Diffusion loss ratio | 2e-2 |
152
+
153
+ ### Stage 2-continued
154
+
155
+ | Hyperparameter | Value |
156
+ | ------------------------ | --------------------- |
157
+ | Warmup steps | 100 |
158
+ | Decay steps | 2 900 |
159
+ | Total train steps | 3 000 |
160
+ | Learning rate schedule | Warmup + Linear decay |
161
+ | Max learning rate | 3e-4 |
162
+ | Final learning rate | 0 |
163
+ | Weight decay | 1e-1 |
164
+ | Sequence length | 16 364 |
165
+ | Batch size | 512 |
166
+ | RoPe theta | 2 000 000 |
167
+ | Clip grad | 1.0 |
168
+ | Adam epsilon | 1e-8 |
169
+ | Adam beta_1 | 0.9 |
170
+ | Adam beta_2 | 0.95 |
171
+ | RMSNorm epsilon | 1e-6 |
172
+ | Z-loss ratio | 1e-5 |
173
+ | Diffusion loss ratio | 2e-2 |
174
+
175
+ ## Acknowledgements
176
+
177
  Training was conducted as a part of the [HPLT project](https://hplt-project.org/).
178
 
179
  _This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]_