HPLT
/

NorOLMo-13B

@@ -8,6 +8,8 @@ datasets:
 - LLM360/MegaMath
 - HuggingFaceTB/stack-edu
 - HuggingFaceFW/finepdfs-edu
 language:
 - nb
 - nn
@@ -27,7 +29,7 @@ tags:
 This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
-The model was trained for 33 000 steps on around 300 billion tokens. Intermediate checkpoints are published here as branches.
 ## Data Details
@@ -37,7 +39,7 @@ Data
   - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
   - FinePDFs Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
   - OLMo-Mix
-  - Northern Sami
 Data Splits
 | Data                     | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
@@ -55,19 +57,19 @@ Data Splits
 | FinePDFs Swedish         | 2.48       | 18.9B         | 5.0B         | 4.1M                | 4 574                   |
 | FinePDFs Danish          | 2.45       | 10.1B         | 4.9B         | 2.4M                | 4 190                   |
 | Northern Sami            | 0.18       | 46.4M         | 0.4B         | 0.2M                | 288                     |
-| Wiki (OLMo-Mix)          | 0.02       | 0.2B          | 40.3M        | 36.5M               | 667                     |
-| Alg. Stack (OLMo-Mix)    | 0.04       | 0.6B          | 80.5M        | 36.5M               | 4 201                   |
-| Open Web Math (OLMo-Mix) | 0.04       | 0.6B          | 80.5M        | 36.5M               | 4 199                   |
-| ArXiv (OLMo-Mix)         | 0.05       | 1.0B          | 0.1B         | 36.5M               | 5 210                   |
-| PeS2o (OLMo-Mix)         | 0.15       | 2.5B          | 0.3B         | 36.5M               | 1 641                   |
-| DCLM (OLMo-Mix)          | 9.50       | 48.3B         | 19.1B        | 36.5M               | 1 377                   |
-| StarCoder (OLMo-Mix)     | 2.10       | 30.5B         | 4.2B         | 36.5M               | 1 293                   |
 > [!NOTE]
 > The number of documents represents the total unique number of documents, not the documents used during training.
 > [!NOTE]
-> We only took a portion of DCLM as our unique data.
 ### Stage 2 (6 000 steps -- 50B tokens)
@@ -75,7 +77,7 @@ Data
   - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Bokmål, Nynorsk, Icelandic, Danish, Swedish
   - FinePDFs-Edu Bokmål, Nynorsk, Icelandic, Danish, Swedish, English
   - FindePDFs Faroese
-  - Northern Sami
   - Stack-Edu
   - MegaMath Web-Pro
   - FineMath 4+
@@ -139,7 +141,7 @@ Same data as for stage 2 but using half the total tokens.
 | Initial learning rate    | 3e-4              |
 | Final learning rate      | 0                 |
 | Weight decay             | 1e-1              |
-| Sequence length          | 16 364            |
 | Batch size               | 512               |
 | RoPe theta               | 2 000 000         |
 | Clip grad                | 1.0               |
@@ -161,7 +163,7 @@ Same data as for stage 2 but using half the total tokens.
 | Max learning rate        | 3e-4                  |
 | Final learning rate      | 0                     |
 | Weight decay             | 1e-1                  |
-| Sequence length          | 16 364                |
 | Batch size               | 512                   |
 | RoPe theta               | 2 000 000             |
 | Clip grad                | 1.0                   |

 - LLM360/MegaMath
 - HuggingFaceTB/stack-edu
 - HuggingFaceFW/finepdfs-edu
+- cis-lmu/Glot500
+- ltg/saami-web
 language:
 - nb
 - nn
 This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
+The model was trained for 33 000 steps on around 275 billion tokens. Intermediate checkpoints are published here as branches.
 ## Data Details
   - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
   - FinePDFs Bokmål, Nynorsk, Faroese, Icelandic, Danish, Swedish
   - OLMo-Mix
+  - Northern Sami (cis-lmu/Glot500, ltg/saami-web, SIKOR North Saami corpus)
 Data Splits
 | Data                     | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
 | FinePDFs Swedish         | 2.48       | 18.9B         | 5.0B         | 4.1M                | 4 574                   |
 | FinePDFs Danish          | 2.45       | 10.1B         | 4.9B         | 2.4M                | 4 190                   |
 | Northern Sami            | 0.18       | 46.4M         | 0.4B         | 0.2M                | 288                     |
+| Wiki (OLMo-Mix)          | 0.02       | 0.2B          | 40.3M        | 0.3M                | 667                     |
+| Alg. Stack (OLMo-Mix)    | 0.04       | 0.6B          | 80.5M        | 0.1M                | 4 201                   |
+| Open Web Math (OLMo-Mix) | 0.04       | 0.6B          | 80.5M        | 0.1M                | 4 199                   |
+| ArXiv (OLMo-Mix)         | 0.05       | 1.0B          | 0.1B         | 0.2M                | 5 210                   |
+| PeS2o (OLMo-Mix)         | 0.15       | 2.5B          | 0.3B         | 1.6M                | 1 641                   |
+| DCLM (OLMo-Mix)          | 9.50       | 48.3B         | 19.1B        | 35.1M               | 1 377                   |
+| StarCoder (OLMo-Mix)     | 2.10       | 30.5B         | 4.2B         | 23.6M               | 1 293                   |
 > [!NOTE]
 > The number of documents represents the total unique number of documents, not the documents used during training.
 > [!NOTE]
+> We only took a portion of OLMo-Mix as our unique data.
 ### Stage 2 (6 000 steps -- 50B tokens)
   - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Bokmål, Nynorsk, Icelandic, Danish, Swedish
   - FinePDFs-Edu Bokmål, Nynorsk, Icelandic, Danish, Swedish, English
   - FindePDFs Faroese
+  - Northern Sami (cis-lmu/Glot500, ltg/saami-web, SIKOR North Saami corpus)
   - Stack-Edu
   - MegaMath Web-Pro
   - FineMath 4+
 | Initial learning rate    | 3e-4              |
 | Final learning rate      | 0                 |
 | Weight decay             | 1e-1              |
+| Sequence length          | 16 384            |
 | Batch size               | 512               |
 | RoPe theta               | 2 000 000         |
 | Clip grad                | 1.0               |
 | Max learning rate        | 3e-4                  |
 | Final learning rate      | 0                     |
 | Weight decay             | 1e-1                  |
+| Sequence length          | 16 384                |
 | Batch size               | 512                   |
 | RoPe theta               | 2 000 000             |
 | Clip grad                | 1.0                   |