The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper
•
2303.03915
•
Published
•
7
Multilingual Generative Pretrained Transformer with 176B parameters with capacity for Finnish. This model is built upon pretrained BLOOM which is then further pretrained with a combined ROOTS + Finnish (without weighting) dataset for 40B tokens.
Datasets
We used a combination of multiple Finnish resources.
Sampling ratios for Finnish
| Dataset | Chars | Ratio | Weight | W.Ratio |
|---|---|---|---|---|
| Parsebank | 35.0B | 16.9% | 1.5 | 22.7% |
| mC4-Fi | 46.3B | 22.4% | 1.0 | 20.0% |
| CC-Fi | 79.6B | 38.5% | 1.0 | 34.4% |
| Fiwiki | 0.8B | 0.4% | 3.0 | 1.0% |
| Lönnrot | 0.8B | 0.4% | 3.0 | 1.0% |
| Yle | 1.6B | 0.8% | 2.0 | 1.4% |
| STT | 2.2B | 1.1% | 2.0 | 1.9% |
| ePub | 13.5B | 6.5% | 1.0 | 5.8% |
| Lehdet | 5.8B | 2.8% | 1.0 | 2.5% |
| Suomi24 | 20.6B | 9.9% | 1.0 | 8.9% |
| Reddit-Fi | 0.7B | 0.4% | 1.0 | 0.3% |
| TOTAL | 207.0B | 100.0% | N/A | 100.0% |
And for whole continued pretraining, ROOTS is mixed in.