Update README.md
Browse files
README.md
CHANGED
|
@@ -150,6 +150,68 @@ In line with our commitment to transparency, open science, and the sharing of kn
|
|
| 150 |
|
| 151 |
## Data and benchmark information
|
| 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
Coming soon!
|
| 154 |
|
| 155 |
## Usage and Limitations
|
|
|
|
| 150 |
|
| 151 |
## Data and benchmark information
|
| 152 |
|
| 153 |
+
We provide a mixture of datasets used during each of the training stages. During the CPT stages, **99 %** of the data was used as a training set, while the remaining percent was used as a validation set. During the SFT stages train/validation split was **90/10**. The stats for CPT stages were computed after the initial documents were tokenized, split into units that fit into the context window, merged together using sequence packing and padded to full context window.
|
| 154 |
+
|
| 155 |
+
### Parallel alignment
|
| 156 |
+
|
| 157 |
+
| Corpus | Number of tokens | Number of documents | Total percentage | Short description |
|
| 158 |
+
|-------------------|----------|-------------|------------------|------------------|
|
| 159 |
+
| DGT | 804847616 | 12281 | 6.3 % | English, Slovene and Croatian texts extracted from [DGT corpus](https://joint-research-centre.ec.europa.eu/language-technology-resources/dgt-translation-memory_en). Cutoff date: 2025 Vol 5. |
|
| 160 |
+
| MaCoCu | 430374912 | 6567 | 3.4 % | https://www.clarin.si/repository/xmlui/handle/11356/1813 |
|
| 161 |
+
| KAS | 31391744 | 479 | 0.2 % | https://www.clarin.si/repository/xmlui/handle/11356/1449 |
|
| 162 |
+
| Wikipedia | 11529093120 | 175920 | 90.1 % | English Wikipedia retrieved using [wikipedia_markdown](https://huggingface.co/datasets/zidsi/wikipedia_markdown). Translated into Slovene using [GaMS-9B-Translator](https://huggingface.co/GaMS-Beta/GaMS-9B-SFT-Translator-DPO) to create a parallel corpus. |
|
| 163 |
+
| Total 12795707392 195247
|
| 164 |
+
|
| 165 |
+
### Base CPT
|
| 166 |
+
|
| 167 |
+
| Corpus | Language | Number of tokens | Number of documents | Total percentage | Short description |
|
| 168 |
+
|-------------------|------------|----------|-------------|------------------|------------------|
|
| 169 |
+
| nemotron_pretraining_code | English | 1952120832 | 29787 | 1.9 % | Subsample of [Nemotron-Pretraining-Code-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v1). Downloaded git-repositories from Nemotron-Code-Metadata |
|
| 170 |
+
| nemotron_math_4_plus | English | 2526937088 | 38558 | 2.5 % | Subsample of 4plus split from [Nemotron-CC-Math-v1](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1) |
|
| 171 |
+
| nemotron_math_3 | English | 1210908672 | 18477 | 1.2 % | Subsample of 3 split from [Nemotron-CC-Math-v1](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1) |
|
| 172 |
+
| nemotron_pretraining_sft | English | 3718316032 | 56737 | 3.7 % | Subsample of Nemotron-SFT-General split from [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) |
|
| 173 |
+
| nemotron_high_quality | English | 10479403008 | 159903 | 10.4 % | Subsample of High-Quality-Synthetic split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). Only the examples generated with Qwen3-30B-A3B were considered for selection. |
|
| 174 |
+
| nemotron_diverse_qa | English | 8631353344 | 131704 | 8.6 % | Subsample of DiverseQA split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). |
|
| 175 |
+
| finepdfs_bos | Bosnian | 4815912960 | 73485 | 4.8 % | Subsample of Bosnian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 176 |
+
| finepdfs_hrv | Croatian | 9541124096 | 145586 | 9.5 % | Subsample of Croatian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 177 |
+
| finepdfs_srp | Serbian | 8119844864 | 123899 | 8.0 % | Subsample of Serbian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 178 |
+
| finepdfs_slv | Slovenian | 5925044224 | 90409 | 5.9 % | Subsample of Slovene corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 179 |
+
| trendi | Slovenian | 1737687040 | 26515 | 1.7 % | https://www.clarin.si/repository/xmlui/handle/11356/2064, Cutoff date: December 2023 |
|
| 180 |
+
| classla | Slovenian | 4256432128 | 64948 | 4.2 % | https://www.clarin.si/repository/xmlui/handle/11356/1882, 1 million randomly selected documents were rewritten using 27B Gemma 3 |
|
| 181 |
+
| sl_legal | Slovenian | 1697710080 | 25905 | 1.7 % | Combination of various Slovene legal data (Legal-Information system of Slovenia, Court practice, Uradni List RS) |
|
| 182 |
+
| sl_med | Slovenian | 1598095360 | 24385 | 1.6 % | Combination of crawled data, academic works and journals connected to medicine |
|
| 183 |
+
| metafida | Slovenian | 4591910912 | 70067 | 4.6 % | https://www.clarin.si/repository/xmlui/handle/11356/1775 The following subcorpora were removed: janes_tweet, janes_forum, janes_news, dgt15_sl, classlawiki_sl and tweet_sl |
|
| 184 |
+
| fineweb2 | Slovenian | 13890289664 | 211949 | 13.8 % | Slovene corpus from [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) |
|
| 185 |
+
| kas | Slovenian | 2726035456 | 41596 | 2.7 % | https://www.clarin.si/repository/xmlui/handle/11356/1448 |
|
| 186 |
+
| nuk_combined | Slovenian | 1213267968 | 18513 | 1.2 % | OCR-ed data (Marker, Nanonets, Llama 4 Maverick) from the national library of Slovenia. Mostly old newspapers, some books and scientific journals |
|
| 187 |
+
| nuk_doc | Slovenian | 11570774016 | 176556 | 11.5 % | OCR-ed data (Marker, Nanonets, Llama 4 Maverick) from the national library of Slovenia. Mostly old newspapers, some books and scientific journals |
|
| 188 |
+
| wikipedia_yugo | Slovenian, Croatian, Bosnian, Serbian | 673775616 | 10281 | 0.7 % | Combination of Slovene, Bosnian, Croatian and Serbian (converted to Latin) wikipedia. Retrieved using [wikipedia_markdown](https://huggingface.co/datasets/zidsi/wikipedia_markdown). Cutoff date: January 2025 |
|
| 189 |
+
| Total | | 100876943360 | 1539260 | |
|
| 190 |
+
|
| 191 |
+
### Long
|
| 192 |
+
|
| 193 |
+
| Corpus | Language | Number of tokens | Number of documents | Total percentage | Short description |
|
| 194 |
+
|-------------------|------------|----------|-------------|------------------|------------------|
|
| 195 |
+
| nemotron_math_4_plus | English | 1087373312 | 8296 | 5.4 % | Subsample of 4plus split from [Nemotron-CC-Math-v1](https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1) |
|
| 196 |
+
| nemotron_pretraining_sft | English | 1231945728 | 9399 | 6.1 % | Subsample of Nemotron-SFT-General split from [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) |
|
| 197 |
+
| nemotron_high_quality | English | 2634285056 | 20098 | 13.1 % | https://huggingface.co/datasets/nvidia/Nemotron-CC-v2 | Subsample of High-Quality-Synthetic split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). Only the examples generated with Qwen3-30B-A3B were considered for selection. |
|
| 198 |
+
| nemotron_diverse_qa | English | 1237975040 | 9445 | 6.2 % | https://huggingface.co/datasets/nvidia/Nemotron-CC-v2 | Subsample of DiverseQA split from [Nemotron-CC-v2](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). |
|
| 199 |
+
| finepdfs_bos | Bosnian | 1614282752 | 12316 | 8.0 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Bosnian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 200 |
+
| finepdfs_hrv | Croatian | 2385248256 | 18198 | 11.9 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Croatian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 201 |
+
| finepdfs_srp | Serbian | 2074345472 | 15826 | 10.3 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Serbian corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 202 |
+
| finepdfs_slv | Slovenian | 1969618944 | 15027 | 9.8 % | https://huggingface.co/datasets/HuggingFaceFW/finepdfs | Subsample of Slovene corpus from [FinePDFS](https://huggingface.co/datasets/HuggingFaceFW/finepdfs ). |
|
| 203 |
+
| trendi | Slovenian | 610533376 | 4658 | 3.0 % | https://www.clarin.si/repository/xmlui/handle/11356/2064, Time window: January 2024 - July 2025 |
|
| 204 |
+
| kas_extension | Slovenian | 2256404480 | 17215 | 11.2 % | Final theses from the three Slovene Universities for years 2019-2024. The theses were crawled from University repositories and OCR-ed with LLama 4 Maverick. |
|
| 205 |
+
| math_sl | Slovenian | 1456078848 | 11109 | 7.2 % | Combination of 3 sources: translation of nemotron_math_4_plus (using [GaMS-9B-Translator](https://huggingface.co/GaMS-Beta/GaMS-9B-SFT-Translator-DPO)) and LLama 4 Maverick OCRs of 2 Slovene math/physics journals: Presek and Obzornik za matematiko in fiziko |
|
| 206 |
+
| nemotron_pretraining_sft_translated | Slovenian | 1553858560 | 11855 | 7.7 % | Translations of nemotron_pretraining_sft using [GaMS-9B-Translator](https://huggingface.co/GaMS-Beta/GaMS-9B-SFT-Translator-DPO) |
|
| 207 |
+
| Total | | 20111949824 | 1539260 | |
|
| 208 |
+
|
| 209 |
+
### Base instruction-following SFT
|
| 210 |
+
|
| 211 |
+
### Chat and safety tuning
|
| 212 |
+
|
| 213 |
+
## Evaluation
|
| 214 |
+
|
| 215 |
Coming soon!
|
| 216 |
|
| 217 |
## Usage and Limitations
|