legacy-datasets/mc4
Updated β’ 1.64k β’ 153
How to use rasyosef/bert-amharic-tokenizer with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("rasyosef/bert-amharic-tokenizer", dtype="auto")This repo contains a WordPiece tokenizer trained on the Amharic subset of the oscar and mc4 datasets. It's the same as the BERT tokenizer but trained from scratch on an amharic dataset with a vocabulary size of 30522.
You can load the tokenizer from huggingface hub as follows.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("α¨αααα αα αα» ααα΅ αα΅ααα΅ α΅α
αα΅α αααΈαα α αα°α¨αα α΅αα α αα± α αα αα£αͺα« ααα αα»α α₯α α¨αααααα΅ αα³α ααα’")
Output:
['α¨ααα', '##α αα', 'αα»', 'ααα΅', 'αα΅ααα΅', 'α΅α
αα΅α', 'αααΈαα', 'α αα°α¨αα', 'α΅αα', 'α αα±', 'α αα', 'αα£αͺα«', 'ααα', 'αα»α', 'α₯α', 'α¨αααααα΅', 'αα³α', 'αα', 'α’']
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("rasyosef/bert-amharic-tokenizer", dtype="auto")