| | --- |
| | license: mit |
| | --- |
| | |
| | # umT5 Small |
| |
|
| | The UMT5 model was proposed in [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) |
| | by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. |
| |
|
| | The abstract from the paper is the following: |
| |
|
| | *Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance |
| | between different languages. However previous work has not systematically evaluated the efficacy of different |
| | pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, |
| | that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly |
| | capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a |
| | range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax |
| | outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our |
| | contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters |
| | across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.* |
| |
|
| | # Integration into Transformers |
| |
|
| | Overview of umT5 model integration: |
| |
|
| | * Transformers Integration is on-going, see this awesome [PR](https://github.com/huggingface/transformers/pull/22626) by @agemagician! |
| | * Conversion script (umT5X checkpoints to FLAX) is [here](https://gist.github.com/stefan-it/5d6a4ec89e7ad97181983881434cb4eb). |