yonas
/

AmhT5-tokenizer

Model card Files Files and versions

yonas commited on Jul 5, 2024

Commit

22ab9b6

·

verified ·

1 Parent(s): 30ca125

Update README.md

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -22,8 +22,8 @@ A T5 Tokenizer trained for Amharic language.
 <!-- Provide a longer summary of what this model is. -->
 An MT5Tokenizer based Amharic and English tokenizer trained using [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Wura](https://huggingface.co/datasets/castorini/wura) datasets.
-The aim of this tokenizer is to have a tokenizer that can better represent Amharic while also doing the same for English.
-In order to balance the dataset, I have used only 3 million document sample from the dataset.
 ### Mt% Tokenizer Vs AmhT5 Tokenizer
@@ -47,7 +47,7 @@ print(tokens)
 # ['▁A', '▁', 'Token', 'izer', '▁train', 'ed', '▁for', '▁Am', 'haric', '▁language', '.']
-amhT5 = "yonas/eng_amh_mt5-tokenizer"
 TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False)
 tokens = TOKENIZER.tokenize("ከመዲናዋ በቅርብ ርቀት ላይ በምትገኘው ከተማ")
@@ -60,4 +60,5 @@ tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")
 print(len(tokens)) # 7
 print(tokens)
-# ['▁A', '▁Token', 'izer', '▁trained', '▁for', '▁Amharic', '▁language.']

 <!-- Provide a longer summary of what this model is. -->
 An MT5Tokenizer based Amharic and English tokenizer trained using [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Wura](https://huggingface.co/datasets/castorini/wura) datasets.
+This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English.
+To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as `google/mt5-small`.
 ### Mt% Tokenizer Vs AmhT5 Tokenizer
 # ['▁A', '▁', 'Token', 'izer', '▁train', 'ed', '▁for', '▁Am', 'haric', '▁language', '.']
+amhT5 = "yonas/AmhT5-tokenizer"
 TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False)
 tokens = TOKENIZER.tokenize("ከመዲናዋ በቅርብ ርቀት ላይ በምትገኘው ከተማ")
 print(len(tokens)) # 7
 print(tokens)
+# ['▁A', '▁Token', 'izer', '▁trained', '▁for', '▁Amharic', '▁language.']
+```