Update README.md
Browse files
README.md
CHANGED
|
@@ -281,18 +281,19 @@ for output in outputs:
|
|
| 281 |
|
| 282 |
### Pretraining Data
|
| 283 |
|
| 284 |
-
The training corpus
|
| 285 |
-
|
| 286 |
-
and
|
|
|
|
|
|
|
|
|
|
| 287 |
|
| 288 |

|
| 289 |
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
|
| 295 |
-
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
| 296 |
The remaining 10% comes from smaller sources in various languages.
|
| 297 |
|
| 298 |
Feel free to click the expand button below to see the full list of sources.
|
|
@@ -431,8 +432,6 @@ To consult the data summary document with the respective licences, please send a
|
|
| 431 |
|
| 432 |
</details>
|
| 433 |
|
| 434 |
-
The model is being trained for 3 epochs meaning that the total number of tokens seen during pre-training will amount to roughly 9.2 trillion tokens (currently still training).
|
| 435 |
-
|
| 436 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
| 437 |
|
| 438 |
<details>
|
|
@@ -442,7 +441,7 @@ We provide an extense Datasheet section following the best practices defined by
|
|
| 442 |
|
| 443 |
**For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
|
| 444 |
|
| 445 |
-
The purpose of creating this dataset is to pre-train
|
| 446 |
European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
|
| 447 |
languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
|
| 448 |
|
|
@@ -464,6 +463,7 @@ and public institutions, which can be found in detail in the acknowledgements.
|
|
| 464 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
| 465 |
|
| 466 |
This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
|
|
|
|
| 467 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
| 468 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
| 469 |
|
|
@@ -490,10 +490,10 @@ We provide a complete list of dataset sources at the end of this section.
|
|
| 490 |
**How many instances are there in total (of each type, if appropriate)?**
|
| 491 |
|
| 492 |
The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
|
| 493 |
-
represents the largest portion, accounting for 39.
|
| 494 |
-
while Catalan (1.
|
| 495 |
-
by half, making up
|
| 496 |
-
(
|
| 497 |
|
| 498 |
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
|
| 499 |
|
|
@@ -628,7 +628,7 @@ and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was use
|
|
| 628 |
|
| 629 |
**Has the dataset been used for any tasks already? If so, please provide a description.**
|
| 630 |
|
| 631 |
-
Pre-train the
|
| 632 |
|
| 633 |
**What (other) tasks could the dataset be used for?**
|
| 634 |
|
|
|
|
| 281 |
|
| 282 |
### Pretraining Data
|
| 283 |
|
| 284 |
+
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
| 285 |
+
The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
| 286 |
+
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
| 287 |
+
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
| 288 |
+
Following, during the following two epochs, the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
| 289 |
+
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
| 290 |
|
| 291 |

|
| 292 |
|
| 293 |
+
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
| 294 |
+
Following this, Starcoder provides 13,67%, and FineWebEdu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
|
| 295 |
+
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
| 296 |
+
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
|
|
|
|
|
|
| 297 |
The remaining 10% comes from smaller sources in various languages.
|
| 298 |
|
| 299 |
Feel free to click the expand button below to see the full list of sources.
|
|
|
|
| 432 |
|
| 433 |
</details>
|
| 434 |
|
|
|
|
|
|
|
| 435 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
| 436 |
|
| 437 |
<details>
|
|
|
|
| 441 |
|
| 442 |
**For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
|
| 443 |
|
| 444 |
+
The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of
|
| 445 |
European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
|
| 446 |
languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
|
| 447 |
|
|
|
|
| 463 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
| 464 |
|
| 465 |
This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
|
| 466 |
+
|
| 467 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
| 468 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
| 469 |
|
|
|
|
| 490 |
**How many instances are there in total (of each type, if appropriate)?**
|
| 491 |
|
| 492 |
The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
|
| 493 |
+
represents the largest portion, accounting for 39.31% of the total data. Spanish was upsampled by a factor of 2, bringing its share to 16.12%,
|
| 494 |
+
while Catalan (1.97%), Basque (0.24%), and Galician (0.31%) were also upsampled by 2. On the other hand, code-related data was downsampled
|
| 495 |
+
by half, making up 5.78% of the total. Other prominent languages include French (6.6%), Russian (5.56%), German (4.79%), and Hungarian
|
| 496 |
+
(4.59%), with several additional languages contributing between 1% and 2%, and smaller portions represented by a variety of others.
|
| 497 |
|
| 498 |
**Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
|
| 499 |
|
|
|
|
| 628 |
|
| 629 |
**Has the dataset been used for any tasks already? If so, please provide a description.**
|
| 630 |
|
| 631 |
+
Pre-train the Salamandra model family.
|
| 632 |
|
| 633 |
**What (other) tasks could the dataset be used for?**
|
| 634 |
|