Update README.md
Browse files
README.md
CHANGED
|
@@ -42,6 +42,25 @@ language:
|
|
| 42 |
datasets:
|
| 43 |
- oscar-corpus/colossal-oscar-1.0
|
| 44 |
- HuggingFaceFW/fineweb-edu
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
---
|
| 46 |
|
| 47 |

|
|
@@ -51,7 +70,7 @@ datasets:
|
|
| 51 |
>
|
| 52 |
> The weights will be promptly updated as soon as the training process is complete.
|
| 53 |
|
| 54 |
-
#
|
| 55 |
|
| 56 |
ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
|
| 57 |
|
|
@@ -288,13 +307,13 @@ The pre-training corpus comprises data from 35 European languages and 92 program
|
|
| 288 |
The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
| 289 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
| 290 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
| 291 |
-
|
| 292 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
| 293 |
|
| 294 |

|
| 295 |
|
| 296 |
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
| 297 |
-
Following this, Starcoder provides 13,67%, and
|
| 298 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
| 299 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
| 300 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
|
| 42 |
datasets:
|
| 43 |
- oscar-corpus/colossal-oscar-1.0
|
| 44 |
- HuggingFaceFW/fineweb-edu
|
| 45 |
+
- joelniklaus/eurlex_resources
|
| 46 |
+
- joelniklaus/legal-mc4
|
| 47 |
+
- projecte-aina/CATalog
|
| 48 |
+
- UFRGS/brwac
|
| 49 |
+
- community-datasets/hrwac
|
| 50 |
+
- danish-foundation-models/danish-gigaword
|
| 51 |
+
- HiTZ/euscrawl
|
| 52 |
+
- PleIAs/French-PD-Newspapers
|
| 53 |
+
- PleIAs/French-PD-Books
|
| 54 |
+
- AI-team-UoA/greek_legal_code
|
| 55 |
+
- HiTZ/latxa-corpus-v1.1
|
| 56 |
+
- allenai/peS2o
|
| 57 |
+
- pile-of-law/pile-of-law
|
| 58 |
+
- PORTULAN/parlamento-pt
|
| 59 |
+
- hoskinson-center/proof-pile
|
| 60 |
+
- togethercomputer/RedPajama-Data-1T
|
| 61 |
+
- bigcode/starcoderdata
|
| 62 |
+
- bjoernp/tagesschau-2018-2023
|
| 63 |
+
- EleutherAI/the_pile_deduplicated
|
| 64 |
---
|
| 65 |
|
| 66 |

|
|
|
|
| 70 |
>
|
| 71 |
> The weights will be promptly updated as soon as the training process is complete.
|
| 72 |
|
| 73 |
+
# ALIA-40b Model Card
|
| 74 |
|
| 75 |
ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
|
| 76 |
|
|
|
|
| 307 |
The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
| 308 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
| 309 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
| 310 |
+
During the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
| 311 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
| 312 |
|
| 313 |

|
| 314 |
|
| 315 |
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
| 316 |
+
Following this, Starcoder provides 13,67%, and FineWeb-Edu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
|
| 317 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
| 318 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
| 319 |
The remaining 10% comes from smaller sources in various languages.
|