Tomlim
/

myt5-large

 ---
 license: mit
+language:
+- af
+- am
+- ar
+- az
+- be
+- bg
+- bn
+- ca
+- ceb
+- co
+- cs
+- cy
+- da
+- de
+- el
+- en
+- eo
+- es
+- et
+- eu
+- fa
+- fi
+- fil
+- fr
+- fy
+- ga
+- gd
+- gl
+- gu
+- ha
+- haw
+- he
+- hi
+- hmn
+- ht
+- hu
+- hy
+- id
+- ig
+- is
+- it
+- iw
+- ja
+- jv
+- ka
+- kk
+- km
+- kn
+- ko
+- ku
+- ky
+- la
+- lb
+- lo
+- lt
+- lv
+- mg
+- mi
+- mk
+- ml
+- mn
+- mr
+- ms
+- mt
+- my
+- ne
+- nl
+- 'no'
+- ny
+- pa
+- pl
+- ps
+- pt
+- ro
+- ru
+- sd
+- si
+- sk
+- sl
+- sm
+- sn
+- so
+- sq
+- sr
+- st
+- su
+- sv
+- sw
+- ta
+- te
+- tg
+- th
+- tr
+- uk
+- und
+- ur
+- uz
+- vi
+- xh
+- yi
+- yo
+- zh
+- zu
+datasets:
+- mc4
 ---
+# MyT5
+## Model Details
+MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture.
+The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
+- **Funded by:** University of Washington Fellowship, Charles University Grant Agency
+- **Model type:** T5
+- **Language(s) (NLP):** Multilingual
+- **License:** MIT
+### Model Sizes
+- **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters
+- **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters
+- **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters
+### Model Sources
+<!-- Provide the basic links for the model. -->
+- **[Repository](https://github.com/tomlimi/MYTE)**
+- **[Paper](https://arxiv.org/pdf/2403.10691.pdf)**
+## How to Get Started with the Model
+The snippet below shows the basic usage of the model for multilingual language modeling.
+Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
+We also plan to release it on HuggingFace in the future.
+```python
+from transformers import T5ForConditionalGeneration
+from src.myt5.myt5_tokenizer import MyT5Tokenizer
+import torch
+MODEL_SIZE = "large" # small, base, or large
+model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
+tokenizer = MyT5Tokenizer()
+pre_texts = ['"We now have',
+            '„Mamy teraz myszy w wieku',
+            '"""எங்களிடம் இப்போது']
+post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
+              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
+              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']
+inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
+targets = tokenizer(post_texts, padding="longest", return_tensors="pt")
+outputs = model(**inputs, labels=targets.input_ids)
+probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
+```
+## Training Details
+### Training Data
+The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.
+### Preprocessing
+Instead of UTF-8 bytes, we used morphologically-driven byte representation.
+See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.
+### Training Hyperparameters
+We used the same hyperparameters as in the original ByT5 paper.
+The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.
+### Computational Infrastructure
+Models were trained on TPUs available through TPU Research Cloud (TRC).
+We used v3-8 TPU for training small and base models and v3-32 for a large model.
+The training for each instance took:
+- **Small**: 90h
+- **Base**: 230h
+- **Large**: 190h
+# Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps.
+## Language Modeling
+We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
+To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).
+### Results
+|       |           | ByT5 |        | MyT5 |        |
+|-------|-----------|------|--------|------|--------|
+|       |           | BPEB | T (ms) | BPEB | T (ms) |
+| small | All       | 10.1 | 7.0    | 4.6  | 6.7    |
+|       | Latin     | 4.6  | 5.9    | 4.2  | 6.6    |
+|       | Non Latin | 18.1 | 8.5    | 5.1  | 6.8    |
+| base  | All       | 8.2  | 11.5   | 5.8  | 8.9    |
+|       | Latin     | 4.9  | 9.4    | 5.0  | 8.7    |
+|       | Non Latin | 13.0 | 14.6   | 6.9  | 9.1    |
+| large | All       | 13.4 | 31.8   | 4.6  | 26.7   |
+|       | Latin     | 10.1 | 28.1   | 4.0  | 26.6   |
+|       | Non Latin | 18.2 | 37.3   | 5.4  | 27.0   |
+Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings.
+The inference was run on an A40 GPU core.
+## Downstream Tasks
+We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation.
+The test data come from XTREME-UP benchmark ([Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf)), which covers mainly low-resource languages
+### Fine-tuning
+In each task, we fine-tuned for all languages jointly.
+We used 1e-3 learning rate with square root decay and dropout of 0.1.
+The batch size and training varied across tasks:
+- **NER**: 128 examples per batch, 6000 steps
+- **QA**: 64 examples per batch, 6500 steps
+- **Semantic Parsing**: 64 examples per batch, 1000 steps
+- **MT**: 64 examples per batch, 10000 steps
+### Results
+ Task       | QA (F1)  | NER (F1) | Semantic Parsing (EM)| MT (chrF)
+------------|------|------|------------------|------
+ Flan-PaLM* | 22.9 | 12.0 | 0.1              | ---
+ mT5*       | 59.7 | 74.0 | 21.8             | ---
+ ByT5       | 73.2 | 81.5 | 25.1             | 20.1
+ MyT5       | 75.3 | 80.8 | 19.6             | 20.4
+Inference Times  per example (ms)
+ ByT5       | 36.2 | 13.8 | 13.2             | 15.9
+ MyT5       | 35.6 | 12.6 | 12.4             | 12.6
+The average result of XTREME-UP tasks across low-resource languages.
+The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in [Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf).
+The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.
+## Citation
+```bibtex
+@misc{limisiewicz2024myte,
+      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling},
+      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
+      year={2024},
+      eprint={2403.10691},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## Model Card Author
+[Tomasz Limisiewicz](mailto:limisewicz@ufal.mff.cuni.cz)