Fill-Mask
Transformers
Safetensors
Polish
mt5
text2text-generation
trimmed
lbourdois commited on
Commit
124df00
·
verified ·
1 Parent(s): 7f6e2d7

Update model card for Polish

Browse files
Files changed (1) hide show
  1. README.md +62 -24
README.md CHANGED
@@ -1,24 +1,62 @@
1
- ---
2
- language: pol
3
- license: apache-2.0
4
- tags: [trimmed, mt5, seq2seq]
5
- base_model: google/mt5-base
6
- datasets:
7
- - Lumberjackk/fineweb-2-trimming
8
- ---
9
-
10
- # mt5-base-pol-16384
11
-
12
- Version de [google/mt5-base](https://huggingface.co/google/mt5-base) avec vocabulaire réduit pour **Polish**.
13
-
14
- | | Original | Trimmed |
15
- |---|---|---|
16
- | Vocabulaire | 250,100 | 16,384 |
17
- | Paramètres | 582,401,280 | 223,395,072 |
18
-
19
- ## Usage
20
- ```python
21
- from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
22
- tokenizer = T5Tokenizer.from_pretrained("lbourdois/mt5-base-pol-16384")
23
- model = AutoModelForSeq2SeqLM.from_pretrained("lbourdois/mt5-base-pol-16384")
24
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: fill-mask
3
+ language: pol
4
+ license: apache-2.0
5
+ tags:
6
+ - trimmed
7
+ library_name: transformers
8
+ base_model: google/mt5-base
9
+ base_model_relation: quantized
10
+ datasets:
11
+ - lbourdois/fineweb-2-trimming
12
+ ---
13
+
14
+ # mt5-base-pol-32768
15
+ This model is a **61.64% smaller** version of [google/mt5-base](https://huggingface.co/google/mt5-base) optimized for **Polish** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
16
+ This trimmed model should perform similarly to the original model with only 16,384 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
17
+
18
+ ## Model Statistics
19
+ | Metric | Original | Trimmed | Reduction |
20
+ |--------|----------|---------|-----------|
21
+ | **Vocabulary size** | 250,112 tokens | 16,384 tokens | **93.45%** |
22
+ | **Model size** | 300,176,768 params | 223,395,072 params | **61.64%** |
23
+
24
+ ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mt5-base-16384.png)
25
+
26
+ ## Mining Dataset Statistics
27
+ - **Number of texts used for mining**: 200,000 texts
28
+ - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
29
+
30
+ ## Usage
31
+ ```python
32
+ from transformers import AutoModel, AutoTokenizer
33
+
34
+ model_name = "alphaedge-ai/mt5-base-pol-16384"
35
+ model = AutoModel.from_pretrained(model_name)
36
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
37
+ ```
38
+
39
+ ## Citations
40
+
41
+ #### mT5
42
+ ```
43
+ @misc{xue2021mt5massivelymultilingualpretrained,
44
+ title={mT5: A massively multilingual pre-trained text-to-text transformer},
45
+ author={Linting Xue and Noah Constant and Adam Roberts and Mihir Kale and Rami Al-Rfou and Aditya Siddhant and Aditya Barua and Colin Raffel},
46
+ year={2021},
47
+ eprint={2010.11934},
48
+ archivePrefix={arXiv},
49
+ primaryClass={cs.CL},
50
+ url={https://arxiv.org/abs/2010.11934},
51
+ }
52
+ ```
53
+
54
+ #### Trimming blog post
55
+ ```
56
+ @misc{hf_blogpost_trimming,
57
+ title={Introduction to Trimming},
58
+ author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
59
+ year={2026},
60
+ url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
61
+ }
62
+ ```