dvres commited on
Commit
fde0ffb
·
verified ·
1 Parent(s): e5803ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md CHANGED
@@ -150,6 +150,68 @@ In line with our commitment to transparency, open science, and the sharing of kn
150
 
151
  ## Data and benchmark information
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  Coming soon!
154
 
155
  ## Usage and Limitations
 
150
 
151
  ## Data and benchmark information
152
 
153
+ We provide a mixture of datasets used during each of the training stages. During the CPT stages, **99 %** of the data was used as a training set, while the remaining percent was used as a validation set. During the SFT stages train/validation split was **90/10**. The stats for CPT stages were computed after the initial documents were tokenized, split into units that fit into the context window, merged together using sequence packing and padded to full context window.
154
+
155
+ ### Parallel alignment
156
+
157
+ | Corpus | Number of tokens | Number of documents | Total percentage | Short description |
158
+ |-------------------|----------|-------------|------------------|------------------|
159
+ | DGT | 804847616 | 12281 | 6.3 % | English, Slovene and Croatian texts extracted from [DGT corpus](https://joint-research-centre.ec.europa.eu/language-technology-resources/dgt-translation-memory_en). Cutoff date: 2025 Vol 5. |
160
+ | MaCoCu | 430374912 | 6567 | 3.4 % | https://www.clarin.si/repository/xmlui/handle/11356/1813 |
161
+ | KAS | 31391744 | 479 | 0.2 % | https://www.clarin.si/repository/xmlui/handle/11356/1449 |
162
+ | Wikipedia | 11529093120 | 175920 | 90.1 % | English Wikipedia retrieved using [wikipedia_markdown](https://huggingface.co/datasets/zidsi/wikipedia_markdown). Translated into Slovene using [GaMS-9B-Translator](https://huggingface.co/GaMS-Beta/GaMS-9B-SFT-Translator-DPO) to create a parallel corpus. |
163
+ | Total 12795707392 195247
164
+
165
+ ### Base CPT
166
+
167
+ | Corpus | Language | Number of tokens | Number of documents | Total percentage | Short description |
168
+ |-------------------|------------|----------|-------------|------------------|------------------|
169
+ | nemotron_pretraining_code | English | 1952120832 | 29787 | 1.9 % | Subsample of [Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1). Downloaded git-repositories from Nemotron-Code-Metadata |
170
+ | nemotron_math_4_plus | English | 2526937088 | 38558 | 2.5 % | Subsample of 4plus split from [Nemotron-CC-Math-v1](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1) |
171
+ | nemotron_math_3 | English | 1210908672 | 18477 | 1.2 % | Subsample of 3 split from [Nemotron-CC-Math-v1](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1) |
172
+ | nemotron_pretraining_sft | English | 3718316032 | 56737 | 3.7 % | Subsample of Nemotron-SFT-General split from [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) |
173
+ | nemotron_high_quality | English | 10479403008 | 159903 | 10.4 % | Subsample of High-Quality-Synthetic split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). Only the examples generated with Qwen3-30B-A3B were considered for selection. |
174
+ | nemotron_diverse_qa | English | 8631353344 | 131704 | 8.6 % | Subsample of DiverseQA split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). |
175
+ | finepdfs_bos | Bosnian | 4815912960 | 73485 | 4.8 % | Subsample of Bosnian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
176
+ | finepdfs_hrv | Croatian | 9541124096 | 145586 | 9.5 % | Subsample of Croatian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
177
+ | finepdfs_srp | Serbian | 8119844864 | 123899 | 8.0 % | Subsample of Serbian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
178
+ | finepdfs_slv | Slovenian | 5925044224 | 90409 | 5.9 % | Subsample of Slovene corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
179
+ | trendi | Slovenian | 1737687040 | 26515 | 1.7 % | https://www.clarin.si/repository/xmlui/handle/11356/2064, Cutoff date: December 2023 |
180
+ | classla | Slovenian | 4256432128 | 64948 | 4.2 % | https://www.clarin.si/repository/xmlui/handle/11356/1882, 1 million randomly selected documents were rewritten using 27B Gemma 3 |
181
+ | sl_legal | Slovenian | 1697710080 | 25905 | 1.7 % | Combination of various Slovene legal data (Legal-Information system of Slovenia, Court practice, Uradni List RS) |
182
+ | sl_med | Slovenian | 1598095360 | 24385 | 1.6 % | Combination of crawled data, academic works and journals connected to medicine |
183
+ | metafida | Slovenian | 4591910912 | 70067 | 4.6 % | https://www.clarin.si/repository/xmlui/handle/11356/1775 The following subcorpora were removed: janes_tweet, janes_forum, janes_news, dgt15_sl, classlawiki_sl and tweet_sl |
184
+ | fineweb2 | Slovenian | 13890289664 | 211949 | 13.8 % | Slovene corpus from [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) |
185
+ | kas | Slovenian | 2726035456 | 41596 | 2.7 % | https://www.clarin.si/repository/xmlui/handle/11356/1448 |
186
+ | nuk_combined | Slovenian | 1213267968 | 18513 | 1.2 % | OCR-ed data (Marker, Nanonets, Llama 4 Maverick) from the national library of Slovenia. Mostly old newspapers, some books and scientific journals |
187
+ | nuk_doc | Slovenian | 11570774016 | 176556 | 11.5 % | OCR-ed data (Marker, Nanonets, Llama 4 Maverick) from the national library of Slovenia. Mostly old newspapers, some books and scientific journals |
188
+ | wikipedia_yugo | Slovenian, Croatian, Bosnian, Serbian | 673775616 | 10281 | 0.7 % | Combination of Slovene, Bosnian, Croatian and Serbian (converted to Latin) wikipedia. Retrieved using [wikipedia_markdown](https://huggingface.co/datasets/zidsi/wikipedia_markdown). Cutoff date: January 2025 |
189
+ | Total | | 100876943360 | 1539260 | |
190
+
191
+ ### Long
192
+
193
+ | Corpus | Language | Number of tokens | Number of documents | Total percentage | Short description |
194
+ |-------------------|------------|----------|-------------|------------------|------------------|
195
+ | nemotron_math_4_plus | English | 1087373312 | 8296 | 5.4 % | Subsample of 4plus split from [Nemotron-CC-Math-v1](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1) |
196
+ | nemotron_pretraining_sft | English | 1231945728 | 9399 | 6.1 % | Subsample of Nemotron-SFT-General split from [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) |
197
+ | nemotron_high_quality | English | 2634285056 | 20098 | 13.1 % | https://huggingface.co/datasets/nvidia/Nemotron-CC-v2 | Subsample of High-Quality-Synthetic split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). Only the examples generated with Qwen3-30B-A3B were considered for selection. |
198
+ | nemotron_diverse_qa | English | 1237975040 | 9445 | 6.2 % | https://huggingface.co/datasets/nvidia/Nemotron-CC-v2 | Subsample of DiverseQA split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). |
199
+ | finepdfs_bos | Bosnian | 1614282752 | 12316 | 8.0 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Bosnian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
200
+ | finepdfs_hrv | Croatian | 2385248256 | 18198 | 11.9 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Croatian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
201
+ | finepdfs_srp | Serbian | 2074345472 | 15826 | 10.3 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Serbian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
202
+ | finepdfs_slv | Slovenian | 1969618944 | 15027 | 9.8 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Slovene corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
203
+ | trendi | Slovenian | 610533376 | 4658 | 3.0 % | https://www.clarin.si/repository/xmlui/handle/11356/2064, Time window: January 2024 - July 2025 |
204
+ | kas_extension | Slovenian | 2256404480 | 17215 | 11.2 % | Final theses from the three Slovene Universities for years 2019-2024. The theses were crawled from University repositories and OCR-ed with LLama 4 Maverick. |
205
+ | math_sl | Slovenian | 1456078848 | 11109 | 7.2 % | Combination of 3 sources: translation of nemotron_math_4_plus (using [GaMS-9B-Translator](https://huggingface.co/GaMS-Beta/GaMS-9B-SFT-Translator-DPO)) and LLama 4 Maverick OCRs of 2 Slovene math/physics journals: Presek and Obzornik za matematiko in fiziko |
206
+ | nemotron_pretraining_sft_translated | Slovenian | 1553858560 | 11855 | 7.7 % | Translations of nemotron_pretraining_sft using [GaMS-9B-Translator](https://huggingface.co/GaMS-Beta/GaMS-9B-SFT-Translator-DPO) |
207
+ | Total | | 20111949824 | 1539260 | |
208
+
209
+ ### Base instruction-following SFT
210
+
211
+ ### Chat and safety tuning
212
+
213
+ ## Evaluation
214
+
215
  Coming soon!
216
 
217
  ## Usage and Limitations