Files changed (1) hide show
  1. README.md +109 -7
README.md CHANGED
@@ -3,26 +3,128 @@ license: apache-2.0
3
  datasets:
4
  - HPLT/HPLT3.0
5
  - allenai/olmo-mix-1124
6
- - allenai/MADLAD-400
7
  - HuggingFaceFW/finepdfs
 
 
 
 
8
  language:
9
- - fi
 
 
10
  base_model:
11
  - allenai/OLMo-2-1124-13B
12
  library_name: transformers
13
  tags:
14
- - finnish
15
- - suomi
16
  - HPLT
17
  ---
18
 
 
 
19
  ![](https://hplt-project.org/_next/static/media/logo-hplt.8765d2d4.svg)
20
 
21
  This is a base (not instruction-tuned) large language model, continually pre-trained on Finnish data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
22
 
23
- Our training data mixture included [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Finnish, FinePDF Finnish, MADLAD400 Finnish, OLMo-Mix.
24
- The model was trained for 16 000 steps on around 150 billion tokens. Intermediate checkpoints are published here as branches.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  Training was conducted as a part of the [HPLT project](https://hplt-project.org/).
27
 
28
- _This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]_
 
3
  datasets:
4
  - HPLT/HPLT3.0
5
  - allenai/olmo-mix-1124
 
6
  - HuggingFaceFW/finepdfs
7
+ - HuggingFaceTB/finemath
8
+ - LLM360/MegaMath
9
+ - HuggingFaceTB/stack-edu
10
+ - HuggingFaceFW/finepdfs-edu
11
  language:
12
+ - nb
13
+ - nn
14
+ - 'no'
15
  base_model:
16
  - allenai/OLMo-2-1124-13B
17
  library_name: transformers
18
  tags:
19
+ - norwegian
20
+ - norsk
21
  - HPLT
22
  ---
23
 
24
+ # NorOLMo
25
+
26
  ![](https://hplt-project.org/_next/static/media/logo-hplt.8765d2d4.svg)
27
 
28
  This is a base (not instruction-tuned) large language model, continually pre-trained on Finnish data starting from the English [OLMo2-13B](https://huggingface.co/allenai/OLMo-2-1124-13B) model.
29
 
30
+ The model was trained for 20 000 steps on around 170 billion tokens. Intermediate checkpoints are published here as branches.
31
+
32
+ ## Data Details
33
+
34
+ ### Stage 1 (16 000 steps -- 135B tokens)
35
+
36
+ Data
37
+ - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) Finnish
38
+ - FinePDFs Finnish
39
+ - OLMo-Mix
40
+
41
+ Data Splits
42
+ | Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
43
+ | ------------------------ | ---------- | ------------- | ------------ | ------------------- | ----------------------- |
44
+ | HPLT Finnish | 69.75 | 46.8B | 93.6B | 36.5M | 944 |
45
+ | FinePDFs Finnish | 14.45 | 9.7B | 19.4B | 1.5M | 4 895 |
46
+ | Wiki (OLMo-Mix) | 0.02 | 0.2B | 26.8M | 0.3M | 690 |
47
+ | Alg. Stack (OLMo-Mix) | 0.04 | 0.6B | 53.7M | 0.1M | 4 291 |
48
+ | Open Web Math (OLMo-Mix) | 0.04 | 0.6B | 53.7M | 0.1M | 4 291 |
49
+ | ArXiv (OLMo-Mix) | 0.05 | 1.1B | 67.1M | 0.2M | 5 318 |
50
+ | PeS2o (OLMo-Mix) | 0.15 | 2.6B | 0.2B | 1.6M | 1 692 |
51
+ | DCLM (OLMo-Mix) | 9.50 | 49.7B | 12.8B | 35.1M | 1 416 |
52
+ | StarCoder (OLMo-Mix) | 2.10 | 31.5B | 8.1B | 23.6M | 1 333 |
53
+
54
+ > [!NOTE]
55
+ > The number of documents represents the total unique number of documents, not the documents used during training.
56
+
57
+ > [!NOTE]
58
+ > We only took a portion of OLMo-Mix as our unique data.
59
+
60
+ ### Stage 2 (4 000 steps -- 35B tokens)
61
+
62
+ Data
63
+ - [HPLTv3](https://huggingface.co/datasets/HPLT/HPLT3.0) (filtered) Finnish
64
+ - FinePDFs-Edu Finnish
65
+ - Stack-Edu
66
+ - MegaMath Web-Pro
67
+ - FineMath 4+
68
+ - InfiWebMath 4+
69
+
70
+ Data Splits
71
+ | Data | Percentage | Unique Tokens | Total Tokens | Number of Documents | Average Document Length |
72
+ | ------------------------ | ---------- | ------------- | ------------ | ------------------- | ----------------------- |
73
+ | HPLT Finnish | 40.79 | 3.4B | 13.7B | 3.1M | 1 109 |
74
+ | FinePDFs-Edu Finnish | 17.84 | 1.5B | 6.0B | 0.2M | 7 081 |
75
+ | FinePDFs-Edu English | 15.00 | 7.5B | 5.0B | 1.2M | 6 485 |
76
+ | Stack-Edu | 15.00 | 13.2B | 5.0B | 15.0M | 880 |
77
+ | MegaMath Web-Pro | 4.76 | 14.0B | 1.6B | 15.0M | 937 |
78
+ | FineMath 4+ | 3.51 | 10.4B | 1.2B | 6.7M | 1 545 |
79
+ | InfiWebMath 4+ | 3.09 | 9.1B | 1.0B | 6.3M | 1 447 |
80
+
81
+ ## Training details
82
+
83
+ ### Stage 1
84
+
85
+ | Hyperparameter | Value |
86
+ | ------------------------ | ----------------- |
87
+ | Embedding train steps | 1 000 |
88
+ | Warmup steps | 2 000 |
89
+ | Total train steps | 16 000 |
90
+ | Learning rate schedule | Warmup + constant |
91
+ | Learning rate | 3e-4 |
92
+ | Weight decay | 1e-1 |
93
+ | Sequence length | 4 096 |
94
+ | Batch size | 2 048 |
95
+ | RoPe theta | 500 000 |
96
+ | Clip grad | 1.0 |
97
+ | Adam epsilon | 1e-8 |
98
+ | Adam beta_1 | 0.9 |
99
+ | Adam beta_2 | 0.95 |
100
+ | RMSNorm epsilon | 1e-6 |
101
+ | Z-loss ratio | 1e-5 |
102
+ | Diffusion loss ratio | 2e-2 |
103
+
104
+ ### Stage 2
105
+
106
+ | Hyperparameter | Value |
107
+ | ------------------------ | ----------------- |
108
+ | Decay steps | 4 000 |
109
+ | Total train steps | 4 000 |
110
+ | Learning rate schedule | Linear decay |
111
+ | Initial learning rate | 3e-4 |
112
+ | Final learning rate | 0 |
113
+ | Weight decay | 1e-1 |
114
+ | Sequence length | 16 384 |
115
+ | Batch size | 512 |
116
+ | RoPe theta | 2 000 000 |
117
+ | Clip grad | 1.0 |
118
+ | Adam epsilon | 1e-8 |
119
+ | Adam beta_1 | 0.9 |
120
+ | Adam beta_2 | 0.95 |
121
+ | RMSNorm epsilon | 1e-6 |
122
+ | Z-loss ratio | 1e-5 |
123
+ | Diffusion loss ratio | 2e-2 |
124
+
125
+
126
+ ## Acknowledgements
127
 
128
  Training was conducted as a part of the [HPLT project](https://hplt-project.org/).
129
 
130
+ _This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]_