Update README.md
Browse files
README.md
CHANGED
|
@@ -8,6 +8,8 @@ datasets:
|
|
| 8 |
- LLM360/MegaMath
|
| 9 |
- HuggingFaceTB/stack-edu
|
| 10 |
- HuggingFaceFW/finepdfs-edu
|
|
|
|
|
|
|
| 11 |
language:
|
| 12 |
- nb
|
| 13 |
- nn
|
|
@@ -27,7 +29,7 @@ tags:
|
|
| 27 |
|
| 28 |
This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
|
| 29 |
|
| 30 |
-
The model was trained for 33 000 steps on around
|
| 31 |
|
| 32 |
## Data Details
|
| 33 |
|
|
@@ -37,7 +39,7 @@ Data
|
|
| 37 |
- [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
|
| 38 |
- FinePDFs Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
|
| 39 |
- OLMo-Mix
|
| 40 |
-
- Northern Sami
|
| 41 |
|
| 42 |
Data Splits
|
| 43 |
| Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
|
|
@@ -55,19 +57,19 @@ Data Splits
|
|
| 55 |
| FinePDFs Swedish | 2.48 | 18.9B | 5.0B | 4.1M | 4 574 |
|
| 56 |
| FinePDFs Danish | 2.45 | 10.1B | 4.9B | 2.4M | 4 190 |
|
| 57 |
| Northern Sami | 0.18 | 46.4M | 0.4B | 0.2M | 288 |
|
| 58 |
-
| Wiki (OLMo-Mix) | 0.02 | 0.2B | 40.3M |
|
| 59 |
-
| Alg. Stack (OLMo-Mix) | 0.04 | 0.6B | 80.5M |
|
| 60 |
-
| Open Web Math (OLMo-Mix) | 0.04 | 0.6B | 80.5M |
|
| 61 |
-
| ArXiv (OLMo-Mix) | 0.05 | 1.0B | 0.1B |
|
| 62 |
-
| PeS2o (OLMo-Mix) | 0.15 | 2.5B | 0.3B |
|
| 63 |
-
| DCLM (OLMo-Mix) | 9.50 | 48.3B | 19.1B |
|
| 64 |
-
| StarCoder (OLMo-Mix) | 2.10 | 30.5B | 4.2B |
|
| 65 |
|
| 66 |
> [!NOTE]
|
| 67 |
> The number of documents represents the total unique number of documents, not the documents used during training.
|
| 68 |
|
| 69 |
> [!NOTE]
|
| 70 |
-
> We only took a portion of
|
| 71 |
|
| 72 |
### Stage 2 (6 000 steps -- 50B tokens)
|
| 73 |
|
|
@@ -75,7 +77,7 @@ Data
|
|
| 75 |
- [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Bokmål, Nynorsk, Icelandic, Danish, Swedish
|
| 76 |
- FinePDFs-Edu Bokmål, Nynorsk, Icelandic, Danish, Swedish, English
|
| 77 |
- FindePDFs Faroese
|
| 78 |
-
- Northern Sami
|
| 79 |
- Stack-Edu
|
| 80 |
- MegaMath Web-Pro
|
| 81 |
- FineMath 4+
|
|
@@ -139,7 +141,7 @@ Same data as for stage 2 but using half the total tokens.
|
|
| 139 |
| Initial learning rate | 3e-4 |
|
| 140 |
| Final learning rate | 0 |
|
| 141 |
| Weight decay | 1e-1 |
|
| 142 |
-
| Sequence length | 16
|
| 143 |
| Batch size | 512 |
|
| 144 |
| RoPe theta | 2 000 000 |
|
| 145 |
| Clip grad | 1.0 |
|
|
@@ -161,7 +163,7 @@ Same data as for stage 2 but using half the total tokens.
|
|
| 161 |
| Max learning rate | 3e-4 |
|
| 162 |
| Final learning rate | 0 |
|
| 163 |
| Weight decay | 1e-1 |
|
| 164 |
-
| Sequence length | 16
|
| 165 |
| Batch size | 512 |
|
| 166 |
| RoPe theta | 2 000 000 |
|
| 167 |
| Clip grad | 1.0 |
|
|
|
|
| 8 |
- LLM360/MegaMath
|
| 9 |
- HuggingFaceTB/stack-edu
|
| 10 |
- HuggingFaceFW/finepdfs-edu
|
| 11 |
+
- cis-lmu/Glot500
|
| 12 |
+
- ltg/saami-web
|
| 13 |
language:
|
| 14 |
- nb
|
| 15 |
- nn
|
|
|
|
| 29 |
|
| 30 |
This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
|
| 31 |
|
| 32 |
+
The model was trained for 33 000 steps on around 275 billion tokens. Intermediate checkpoints are published here as branches.
|
| 33 |
|
| 34 |
## Data Details
|
| 35 |
|
|
|
|
| 39 |
- [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
|
| 40 |
- FinePDFs Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
|
| 41 |
- OLMo-Mix
|
| 42 |
+
- Northern Sami (cis-lmu/Glot500, ltg/saami-web, SIKOR North Saami corpus)
|
| 43 |
|
| 44 |
Data Splits
|
| 45 |
| Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
|
|
|
|
| 57 |
| FinePDFs Swedish | 2.48 | 18.9B | 5.0B | 4.1M | 4 574 |
|
| 58 |
| FinePDFs Danish | 2.45 | 10.1B | 4.9B | 2.4M | 4 190 |
|
| 59 |
| Northern Sami | 0.18 | 46.4M | 0.4B | 0.2M | 288 |
|
| 60 |
+
| Wiki (OLMo-Mix) | 0.02 | 0.2B | 40.3M | 0.3M | 667 |
|
| 61 |
+
| Alg. Stack (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 0.1M | 4 201 |
|
| 62 |
+
| Open Web Math (OLMo-Mix) | 0.04 | 0.6B | 80.5M | 0.1M | 4 199 |
|
| 63 |
+
| ArXiv (OLMo-Mix) | 0.05 | 1.0B | 0.1B | 0.2M | 5 210 |
|
| 64 |
+
| PeS2o (OLMo-Mix) | 0.15 | 2.5B | 0.3B | 1.6M | 1 641 |
|
| 65 |
+
| DCLM (OLMo-Mix) | 9.50 | 48.3B | 19.1B | 35.1M | 1 377 |
|
| 66 |
+
| StarCoder (OLMo-Mix) | 2.10 | 30.5B | 4.2B | 23.6M | 1 293 |
|
| 67 |
|
| 68 |
> [!NOTE]
|
| 69 |
> The number of documents represents the total unique number of documents, not the documents used during training.
|
| 70 |
|
| 71 |
> [!NOTE]
|
| 72 |
+
> We only took a portion of OLMo-Mix as our unique data.
|
| 73 |
|
| 74 |
### Stage 2 (6 000 steps -- 50B tokens)
|
| 75 |
|
|
|
|
| 77 |
- [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Bokmål, Nynorsk, Icelandic, Danish, Swedish
|
| 78 |
- FinePDFs-Edu Bokmål, Nynorsk, Icelandic, Danish, Swedish, English
|
| 79 |
- FindePDFs Faroese
|
| 80 |
+
- Northern Sami (cis-lmu/Glot500, ltg/saami-web, SIKOR North Saami corpus)
|
| 81 |
- Stack-Edu
|
| 82 |
- MegaMath Web-Pro
|
| 83 |
- FineMath 4+
|
|
|
|
| 141 |
| Initial learning rate | 3e-4 |
|
| 142 |
| Final learning rate | 0 |
|
| 143 |
| Weight decay | 1e-1 |
|
| 144 |
+
| Sequence length | 16 384 |
|
| 145 |
| Batch size | 512 |
|
| 146 |
| RoPe theta | 2 000 000 |
|
| 147 |
| Clip grad | 1.0 |
|
|
|
|
| 163 |
| Max learning rate | 3e-4 |
|
| 164 |
| Final learning rate | 0 |
|
| 165 |
| Weight decay | 1e-1 |
|
| 166 |
+
| Sequence length | 16 384 |
|
| 167 |
| Batch size | 512 |
|
| 168 |
| RoPe theta | 2 000 000 |
|
| 169 |
| Clip grad | 1.0 |
|