alphaedge-ai
/

mt5-small-mal-32768

text2text-generation

Model card Files Files and versions

lbourdois commited on 10 days ago

Commit

1663a2e

·

verified ·

1 Parent(s): 8536960

Update model card for Malayalam

Files changed (1) hide show

README.md +62 -24

README.md CHANGED Viewed

@@ -1,24 +1,62 @@
----
-language: mal
-license: apache-2.0
-tags: [trimmed, mt5, seq2seq]
-base_model: google/mt5-small
-datasets:
-  - Lumberjackk/fineweb-2-trimming
----
-# mt5-small-mal-32768
-Version de [google/mt5-small](https://huggingface.co/google/mt5-small) avec vocabulaire réduit pour **Malayalam**.
-| | Original | Trimmed |
-|---|---|---|
-| Vocabulaire | 250,100 | 32,768 |
-| Paramètres | 300,176,768 | 77,616,512 |
-## Usage
-```python
-from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
-tokenizer = T5Tokenizer.from_pretrained("lbourdois/mt5-small-mal-32768")
-model     = AutoModelForSeq2SeqLM.from_pretrained("lbourdois/mt5-small-mal-32768")
-```

+---
+pipeline_tag: fill-mask
+language: mal
+license: apache-2.0
+tags:
+  - trimmed
+library_name: transformers
+base_model: google/mt5-small
+base_model_relation: quantized
+datasets:
+  - lbourdois/fineweb-2-trimming
+---
+# mt5-small-mal-32768
+This model is a **74.14% smaller** version of [google/mt5-small](https://huggingface.co/google/mt5-small) optimized for **Malayalam** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
+This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
+## Model Statistics
+| Metric | Original | Trimmed | Reduction |
+|--------|----------|---------|-----------|
+| **Vocabulary size** | 250,112 tokens | 32,768 tokens | **86.90%** |
+| **Model size** | 300,176,768 params | 77,616,512 params | **74.14%** |
+![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mt5-small-32768.png)
+## Mining Dataset Statistics
+- **Number of texts used for mining**: 200,000 texts
+- **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
+## Usage
+```python
+from transformers import AutoModel, AutoTokenizer
+model_name = "alphaedge-ai/mt5-small-mal-32768"
+model = AutoModel.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+```
+## Citations
+#### mT5
+```
+@misc{xue2021mt5massivelymultilingualpretrained,
+      title={mT5: A massively multilingual pre-trained text-to-text transformer},
+      author={Linting Xue and Noah Constant and Adam Roberts and Mihir Kale and Rami Al-Rfou and Aditya Siddhant and Aditya Barua and Colin Raffel},
+      year={2021},
+      eprint={2010.11934},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2010.11934},
+}
+```
+#### Trimming blog post
+```
+@misc{hf_blogpost_trimming,
+      title={Introduction to Trimming},
+      author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
+      year={2026},
+      url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
+}
+```