File size: 2,985 Bytes
725597f 98aeb25 725597f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
language:
- am
- ti
license: mit
tags:
- tokenizer
- byte-pair-encoding
- bpe
- geez-script
- amharic
- tigrinya
- low-resource
- nlp
- morphology-aware
- Horn of Africa
datasets:
- HornMT
library_name: transformers
pipeline_tag: token-classification
widget:
- text: "!"
model-index:
- name: Geez BPE Tokenizer
results: []
---
# Geez Tokenizer (`Hailay/geez-tokenizer`)
A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora and supports morphologically rich low-resource languages.
## ๐ง Motivation
Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:
- Reduce over-segmentation errors
- Respect morpheme boundaries
- Improve language understanding for downstream tasks like Machine Translation and QA
## ๐ Training Details
- **Tokenizer Type**: BPE
- **Vocabulary Size**: 32,000
- **Pre-tokenizer**: Whitespace
- **Normalizer**: NFD โ Lowercase โ StripAccents
- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]`
## ๐ Files
- `vocab.json`: Vocabulary file
- `merges.txt`: Merge rules for BPE
- `tokenizer.json`: Full tokenizer config
- `tokenizer_config.json`: Hugging Face-compatible configuration
- `special_tokens_map.json`: Maps for special tokens
## ๐ Usage
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")
text = "แจแแฅแ
แ แญแชแฆแแแตแถแฝ แ แณแฉแซ แแญแฎแแแต แแตแฅ แจแฐแแแแ แตแแแ แแแฅแญ แ แแแฐแแแข"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print("Tokens:", tokens)
print("Token IDs:", ids)
## ๐ Intended Use
This tokenizer is best suited for:
Low-resource NLP pipelines
Machine Translation
Question Answering
Named Entity Recognition
Morphological analysis
โ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.
Some compound verbs and morphologically fused words may still require linguistic preprocessing.
Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.
โ
#**Evaluation**
The tokenizer was evaluated manually on:
Token coverage of Tigrinya/Amharic corpora
Morphological preservation
Reduction of BPE segmentation errors
Quantitative metrics to be published in an accompanying paper.
๐ #**License**
This tokenizer is licensed under the MIT License.
๐ #**Citation**
@misc{hailay2025geez,
title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
author={Teklehaymanot, Hailay},
year={2025},
howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}
|