alphaedge-ai
/

mt5-base-pol-16384

text2text-generation

Model card Files Files and versions

lbourdois commited on 19 days ago

Commit

124df00

·

verified ·

1 Parent(s): 7f6e2d7

Update model card for Polish

Files changed (1) hide show

README.md +62 -24

README.md CHANGED Viewed

@@ -1,24 +1,62 @@
----
-language: pol
-license: apache-2.0
-tags: [trimmed, mt5, seq2seq]
-base_model: google/mt5-base
-datasets:
-  - Lumberjackk/fineweb-2-trimming
----
-# mt5-base-pol-16384
-Version de [google/mt5-base](https://huggingface.co/google/mt5-base) avec vocabulaire réduit pour **Polish**.
-| | Original | Trimmed |
-|---|---|---|
-| Vocabulaire | 250,100 | 16,384 |
-| Paramètres | 582,401,280 | 223,395,072 |
-## Usage
-```python
-from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
-tokenizer = T5Tokenizer.from_pretrained("lbourdois/mt5-base-pol-16384")
-model     = AutoModelForSeq2SeqLM.from_pretrained("lbourdois/mt5-base-pol-16384")
-```

+---
+pipeline_tag: fill-mask
+language: pol
+license: apache-2.0
+tags:
+  - trimmed
+library_name: transformers
+base_model: google/mt5-base
+base_model_relation: quantized
+datasets:
+  - lbourdois/fineweb-2-trimming
+---
+# mt5-base-pol-32768
+This model is a **61.64% smaller** version of [google/mt5-base](https://huggingface.co/google/mt5-base) optimized for **Polish** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
+This trimmed model should perform similarly to the original model with only 16,384 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
+## Model Statistics
+| Metric | Original | Trimmed | Reduction |
+|--------|----------|---------|-----------|
+| **Vocabulary size** | 250,112 tokens | 16,384 tokens | **93.45%** |
+| **Model size** | 300,176,768 params | 223,395,072 params | **61.64%** |
+![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mt5-base-16384.png)
+## Mining Dataset Statistics
+- **Number of texts used for mining**: 200,000 texts
+- **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
+## Usage
+```python
+from transformers import AutoModel, AutoTokenizer
+model_name = "alphaedge-ai/mt5-base-pol-16384"
+model = AutoModel.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+```
+## Citations
+#### mT5
+```
+@misc{xue2021mt5massivelymultilingualpretrained,
+      title={mT5: A massively multilingual pre-trained text-to-text transformer},
+      author={Linting Xue and Noah Constant and Adam Roberts and Mihir Kale and Rami Al-Rfou and Aditya Siddhant and Aditya Barua and Colin Raffel},
+      year={2021},
+      eprint={2010.11934},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2010.11934},
+}
+```
+#### Trimming blog post
+```
+@misc{hf_blogpost_trimming,
+      title={Introduction to Trimming},
+      author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
+      year={2026},
+      url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
+}
+```