| license: mit | |
| datasets: | |
| - oscar | |
| - mc4 | |
| language: | |
| - am | |
| library_name: transformers | |
| # Amharic WordPiece Tokenizer | |
| This repo contains a **WordPiece** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets. It's the same as the **BERT** tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `30522`. | |
| # How to use | |
| You can load the tokenizer from huggingface hub as follows. | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer") | |
| tokenizer.tokenize("α¨αααα αα αα» ααα΅ αα΅ααα΅ α΅α αα΅α αααΈαα α αα°α¨αα α΅αα α αα± α αα αα£αͺα« ααα αα»α α₯α α¨αααααα΅ αα³α ααα’") | |
| ``` | |
| Output: | |
| ```python | |
| ['α¨ααα', '##α αα', 'αα»', 'ααα΅', 'αα΅ααα΅', 'α΅α αα΅α', 'αααΈαα', 'α αα°α¨αα', 'α΅αα', 'α αα±', 'α αα', 'αα£αͺα«', 'ααα', 'αα»α', 'α₯α', 'α¨αααααα΅', 'αα³α', 'αα', 'α’'] | |
| ``` |