README.md · Hailay/geez-tokenizer at main

File size: 2,985 Bytes

---
language:
  - am
  - ti
license: mit
tags:
  - tokenizer
  - byte-pair-encoding
  - bpe
  - geez-script
  - amharic
  - tigrinya
  - low-resource
  - nlp
  - morphology-aware
  - Horn of Africa
datasets:
  - HornMT
library_name: transformers
pipeline_tag: token-classification
widget:
  - text: "!"
model-index:
  - name: Geez BPE Tokenizer
    results: []
---

# Geez Tokenizer (`Hailay/geez-tokenizer`)

A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora  and supports morphologically rich low-resource languages.

## 🧠 Motivation

Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:

- Reduce over-segmentation errors
- Respect morpheme boundaries
- Improve language understanding for downstream tasks like Machine Translation and QA

## 📚 Training Details

- **Tokenizer Type**: BPE
- **Vocabulary Size**: 32,000
- **Pre-tokenizer**: Whitespace
- **Normalizer**: NFD → Lowercase → StripAccents
- **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]`

## 📁 Files

- `vocab.json`: Vocabulary file
- `merges.txt`: Merge rules for BPE
- `tokenizer.json`: Full tokenizer config
- `tokenizer_config.json`: Hugging Face-compatible configuration
- `special_tokens_map.json`: Maps for special tokens

## 🚀 Usage

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")

text = "የግብፅ አርኪኦሎጂስቶች በሳኩራ ኔክሮፖሊስ ውስጥ የተገኘውን ትልቁን መቃብር አግኝተዋል።"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", ids)

## 📊 Intended Use

This tokenizer is best suited for:

Low-resource NLP pipelines

Machine Translation

Question Answering

Named Entity Recognition

Morphological analysis



✋ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.

Some compound verbs and morphologically fused words may still require linguistic preprocessing.

Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.


✅ #**Evaluation**
The tokenizer was evaluated manually on:

Token coverage of Tigrinya/Amharic corpora

Morphological preservation

Reduction of BPE segmentation errors

Quantitative metrics to be published in an accompanying paper.

📜 #**License**
This tokenizer is licensed under the MIT License.
📌 #**Citation**

@misc{hailay2025geez,
  title={Geʽez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
  author={Teklehaymanot, Hailay},
  year={2025},
  howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}