Fill-Mask
Transformers
Safetensors
Malayalam
mt5
text2text-generation
trimmed
lbourdois commited on
Commit
1663a2e
·
verified ·
1 Parent(s): 8536960

Update model card for Malayalam

Browse files
Files changed (1) hide show
  1. README.md +62 -24
README.md CHANGED
@@ -1,24 +1,62 @@
1
- ---
2
- language: mal
3
- license: apache-2.0
4
- tags: [trimmed, mt5, seq2seq]
5
- base_model: google/mt5-small
6
- datasets:
7
- - Lumberjackk/fineweb-2-trimming
8
- ---
9
-
10
- # mt5-small-mal-32768
11
-
12
- Version de [google/mt5-small](https://huggingface.co/google/mt5-small) avec vocabulaire réduit pour **Malayalam**.
13
-
14
- | | Original | Trimmed |
15
- |---|---|---|
16
- | Vocabulaire | 250,100 | 32,768 |
17
- | Paramètres | 300,176,768 | 77,616,512 |
18
-
19
- ## Usage
20
- ```python
21
- from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
22
- tokenizer = T5Tokenizer.from_pretrained("lbourdois/mt5-small-mal-32768")
23
- model = AutoModelForSeq2SeqLM.from_pretrained("lbourdois/mt5-small-mal-32768")
24
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: fill-mask
3
+ language: mal
4
+ license: apache-2.0
5
+ tags:
6
+ - trimmed
7
+ library_name: transformers
8
+ base_model: google/mt5-small
9
+ base_model_relation: quantized
10
+ datasets:
11
+ - lbourdois/fineweb-2-trimming
12
+ ---
13
+
14
+ # mt5-small-mal-32768
15
+ This model is a **74.14% smaller** version of [google/mt5-small](https://huggingface.co/google/mt5-small) optimized for **Malayalam** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
16
+ This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
17
+
18
+ ## Model Statistics
19
+ | Metric | Original | Trimmed | Reduction |
20
+ |--------|----------|---------|-----------|
21
+ | **Vocabulary size** | 250,112 tokens | 32,768 tokens | **86.90%** |
22
+ | **Model size** | 300,176,768 params | 77,616,512 params | **74.14%** |
23
+
24
+ ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mt5-small-32768.png)
25
+
26
+ ## Mining Dataset Statistics
27
+ - **Number of texts used for mining**: 200,000 texts
28
+ - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
29
+
30
+ ## Usage
31
+ ```python
32
+ from transformers import AutoModel, AutoTokenizer
33
+
34
+ model_name = "alphaedge-ai/mt5-small-mal-32768"
35
+ model = AutoModel.from_pretrained(model_name)
36
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
37
+ ```
38
+
39
+ ## Citations
40
+
41
+ #### mT5
42
+ ```
43
+ @misc{xue2021mt5massivelymultilingualpretrained,
44
+ title={mT5: A massively multilingual pre-trained text-to-text transformer},
45
+ author={Linting Xue and Noah Constant and Adam Roberts and Mihir Kale and Rami Al-Rfou and Aditya Siddhant and Aditya Barua and Colin Raffel},
46
+ year={2021},
47
+ eprint={2010.11934},
48
+ archivePrefix={arXiv},
49
+ primaryClass={cs.CL},
50
+ url={https://arxiv.org/abs/2010.11934},
51
+ }
52
+ ```
53
+
54
+ #### Trimming blog post
55
+ ```
56
+ @misc{hf_blogpost_trimming,
57
+ title={Introduction to Trimming},
58
+ author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
59
+ year={2026},
60
+ url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
61
+ }
62
+ ```