davda54 commited on
Commit
0b0f849
·
verified ·
1 Parent(s): 30a0995

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -13
README.md CHANGED
@@ -8,6 +8,8 @@ datasets:
8
  - LLM360/MegaMath
9
  - HuggingFaceTB/stack-edu
10
  - HuggingFaceFW/finepdfs-edu
 
 
11
  language:
12
  - nb
13
  - nn
@@ -27,7 +29,7 @@ tags:
27
 
28
  This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
29
 
30
- The model was trained for 33 000 steps on around 300 billion tokens. Intermediate checkpoints are published here as branches.
31
 
32
  ## Data Details
33
 
@@ -37,7 +39,7 @@ Data
37
  - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
38
  - FinePDFs Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
39
  - OLMo-Mix
40
- - Northern Sami
41
 
42
  Data Splits
43
  | Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
@@ -55,19 +57,19 @@ Data Splits
55
  | FinePDFs Swedish | 2.48 | 18.9B | 5.0B | 4.1M | 4 574 |
56
  | FinePDFs Danish | 2.45 | 10.1B | 4.9B | 2.4M | 4 190 |
57
  | Northern Sami | 0.18 | 46.4M | 0.4B | 0.2M | 288 |
58
- | Wiki (OLMo-Mix) | 0.02 | 0.2B | 40.3M | 36.5M | 667 |
59
- | Alg. Stack (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 36.5M | 4 201 |
60
- | Open Web Math (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 36.5M | 4 199 |
61
- | ArXiv (OLMo-Mix) | 0.05 | 1.0B | 0.1B | 36.5M | 5 210 |
62
- | PeS2o (OLMo-Mix) | 0.15 | 2.5B | 0.3B | 36.5M | 1 641 |
63
- | DCLM (OLMo-Mix) | 9.50 | 48.3B | 19.1B | 36.5M | 1 377 |
64
- | StarCoder (OLMo-Mix) | 2.10 | 30.5B | 4.2B | 36.5M | 1 293 |
65
 
66
  > [!NOTE]
67
  > The number of documents represents the total unique number of documents, not the documents used during training.
68
 
69
  > [!NOTE]
70
- > We only took a portion of DCLM as our unique data.
71
 
72
  ### Stage 2 (6 000 steps -- 50B tokens)
73
 
@@ -75,7 +77,7 @@ Data
75
  - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Bokmål, Nynorsk, Icelandic, Danish, Swedish
76
  - FinePDFs-Edu Bokmål, Nynorsk, Icelandic, Danish, Swedish, English
77
  - FindePDFs Faroese
78
- - Northern Sami
79
  - Stack-Edu
80
  - MegaMath Web-Pro
81
  - FineMath 4+
@@ -139,7 +141,7 @@ Same data as for stage 2 but using half the total tokens.
139
  | Initial learning rate | 3e-4 |
140
  | Final learning rate | 0 |
141
  | Weight decay | 1e-1 |
142
- | Sequence length | 16 364 |
143
  | Batch size | 512 |
144
  | RoPe theta | 2 000 000 |
145
  | Clip grad | 1.0 |
@@ -161,7 +163,7 @@ Same data as for stage 2 but using half the total tokens.
161
  | Max learning rate | 3e-4 |
162
  | Final learning rate | 0 |
163
  | Weight decay | 1e-1 |
164
- | Sequence length | 16 364 |
165
  | Batch size | 512 |
166
  | RoPe theta | 2 000 000 |
167
  | Clip grad | 1.0 |
 
8
  - LLM360/MegaMath
9
  - HuggingFaceTB/stack-edu
10
  - HuggingFaceFW/finepdfs-edu
11
+ - cis-lmu/Glot500
12
+ - ltg/saami-web
13
  language:
14
  - nb
15
  - nn
 
29
 
30
  This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
31
 
32
+ The model was trained for 33 000 steps on around 275 billion tokens. Intermediate checkpoints are published here as branches.
33
 
34
  ## Data Details
35
 
 
39
  - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
40
  - FinePDFs Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
41
  - OLMo-Mix
42
+ - Northern Sami (cis-lmu/Glot500, ltg/saami-web, SIKOR North Saami corpus)
43
 
44
  Data Splits
45
  | Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
 
57
  | FinePDFs Swedish | 2.48 | 18.9B | 5.0B | 4.1M | 4 574 |
58
  | FinePDFs Danish | 2.45 | 10.1B | 4.9B | 2.4M | 4 190 |
59
  | Northern Sami | 0.18 | 46.4M | 0.4B | 0.2M | 288 |
60
+ | Wiki (OLMo-Mix) | 0.02 | 0.2B | 40.3M | 0.3M | 667 |
61
+ | Alg. Stack (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 0.1M | 4 201 |
62
+ | Open Web Math (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 0.1M | 4 199 |
63
+ | ArXiv (OLMo-Mix) | 0.05 | 1.0B | 0.1B | 0.2M | 5 210 |
64
+ | PeS2o (OLMo-Mix) | 0.15 | 2.5B | 0.3B | 1.6M | 1 641 |
65
+ | DCLM (OLMo-Mix) | 9.50 | 48.3B | 19.1B | 35.1M | 1 377 |
66
+ | StarCoder (OLMo-Mix) | 2.10 | 30.5B | 4.2B | 23.6M | 1 293 |
67
 
68
  > [!NOTE]
69
  > The number of documents represents the total unique number of documents, not the documents used during training.
70
 
71
  > [!NOTE]
72
+ > We only took a portion of OLMo-Mix as our unique data.
73
 
74
  ### Stage 2 (6 000 steps -- 50B tokens)
75
 
 
77
  - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Bokmål, Nynorsk, Icelandic, Danish, Swedish
78
  - FinePDFs-Edu Bokmål, Nynorsk, Icelandic, Danish, Swedish, English
79
  - FindePDFs Faroese
80
+ - Northern Sami (cis-lmu/Glot500, ltg/saami-web, SIKOR North Saami corpus)
81
  - Stack-Edu
82
  - MegaMath Web-Pro
83
  - FineMath 4+
 
141
  | Initial learning rate | 3e-4 |
142
  | Final learning rate | 0 |
143
  | Weight decay | 1e-1 |
144
+ | Sequence length | 16 384 |
145
  | Batch size | 512 |
146
  | RoPe theta | 2 000 000 |
147
  | Clip grad | 1.0 |
 
163
  | Max learning rate | 3e-4 |
164
  | Final learning rate | 0 |
165
  | Weight decay | 1e-1 |
166
+ | Sequence length | 16 384 |
167
  | Batch size | 512 |
168
  | RoPe theta | 2 000 000 |
169
  | Clip grad | 1.0 |