jsaizant commited on
Commit
5aa563b
·
verified ·
1 Parent(s): dbf3c24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -3
README.md CHANGED
@@ -42,6 +42,25 @@ language:
42
  datasets:
43
  - oscar-corpus/colossal-oscar-1.0
44
  - HuggingFaceFW/fineweb-edu
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ---
46
 
47
  ![](./images/logo_alia_2.png)
@@ -51,7 +70,7 @@ datasets:
51
  >
52
  > The weights will be promptly updated as soon as the training process is complete.
53
 
54
- # Salmandra ALIA-40b Model Card
55
 
56
  ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
57
 
@@ -288,13 +307,13 @@ The pre-training corpus comprises data from 35 European languages and 92 program
288
  The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
289
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
290
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
291
- Following, during the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
292
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
293
 
294
  ![lang distrib](./images/corpus_languages.png)
295
 
296
  The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
297
- Following this, Starcoder provides 13,67%, and FineWebEdu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
298
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
299
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
300
  The remaining 10% comes from smaller sources in various languages.
 
42
  datasets:
43
  - oscar-corpus/colossal-oscar-1.0
44
  - HuggingFaceFW/fineweb-edu
45
+ - joelniklaus/eurlex_resources
46
+ - joelniklaus/legal-mc4
47
+ - projecte-aina/CATalog
48
+ - UFRGS/brwac
49
+ - community-datasets/hrwac
50
+ - danish-foundation-models/danish-gigaword
51
+ - HiTZ/euscrawl
52
+ - PleIAs/French-PD-Newspapers
53
+ - PleIAs/French-PD-Books
54
+ - AI-team-UoA/greek_legal_code
55
+ - HiTZ/latxa-corpus-v1.1
56
+ - allenai/peS2o
57
+ - pile-of-law/pile-of-law
58
+ - PORTULAN/parlamento-pt
59
+ - hoskinson-center/proof-pile
60
+ - togethercomputer/RedPajama-Data-1T
61
+ - bigcode/starcoderdata
62
+ - bjoernp/tagesschau-2018-2023
63
+ - EleutherAI/the_pile_deduplicated
64
  ---
65
 
66
  ![](./images/logo_alia_2.png)
 
70
  >
71
  > The weights will be promptly updated as soon as the training process is complete.
72
 
73
+ # ALIA-40b Model Card
74
 
75
  ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
76
 
 
307
  The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
308
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
309
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
310
+ During the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
311
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
312
 
313
  ![lang distrib](./images/corpus_languages.png)
314
 
315
  The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
316
+ Following this, Starcoder provides 13,67%, and FineWeb-Edu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
317
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
318
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
319
  The remaining 10% comes from smaller sources in various languages.