🚀 Geez-BBPE Tokenizer (NaolBM/Geez-BBPE)
A highly efficient Byte-Level BPE tokenizer specifically trained for Geez-script languages (Amharic, Tigrinya). With 25,000 vocabulary size, it achieves 3.67 chars/token on Amharic text - significantly outperforming generic tokenizers.
📊 Benchmark Comparison
| Model | Tokens | Chars/Token |
|---|---|---|
| Geez-BBPE (Ours) | 9 | 4.00 |
| Qwen3 | 47 | 0.77 |
| Llama-3.2 | 87 | 0.41 |
| gemma-3 | 17 | 2.12 |
| gpt-oss | 65 | 0.55 |
Sample: "የኢትዮጵያ ታሪክ በጣም ረጅም እና ባለብዙ ደረጃዎች ነው።" (8 characters)
🧠 Why Geez-BBPE?
Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently, breaking words into meaningless byte sequences.
Geez-BBPE solves this by:
- ✅ Preserving semantic meaning - common words are single tokens
- ✅ Reducing sequence length - fit more text in same context
- ✅ Faster inference - fewer tokens = faster generation
- ✅ Better learning - model sees whole words, not fragments
- ✅ Native support - trained on 10M+ authentic Amharic sentences
📚 Training Details
- Tokenizer Type: Byte-Level BPE
- Vocabulary Size: 25,000 (optimized for Geez script)
- Training Data: 10M+ sentences from
b1n1yam/amharic-combined-corpus - Pre-tokenizer: Byte-Level with space preservation
- Special Tokens:
<|startoftext|>,<|im_end|>,<|pad|>
📁 Files
tokenizer.json: Full tokenizer configtokenizer_config.json: Hugging Face-compatible configurationspecial_tokens_map.json: Maps for special tokens
🚀 Usage
Standalone Usage
from transformers import AutoTokenizer
# Load Geez-BBPE tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Geez-BBPE")
# Tokenize Amharic text efficiently!
text = "ኢትዮጵያ ዋና ከተማዋ አዲስ አበባ ናት።"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f"Tokens ({len(tokens)}): {tokens}")
print(f"Token IDs: {ids}")
Extending Other Models with Geez-BBPE
If you want to extend another model's tokenizer (like Qwen, llama, LMF etc...) using Geez-BBPE:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizers
geez = AutoTokenizer.from_pretrained("NaolBM/Geez-BBPE")
target = AutoTokenizer.from_pretrained("your-target-model")
new_tokens_ids = [
id for id in range(geez.vocab_size)
if geez.decode([id]) not in target.all_special_tokens
]
# 🔑 CRITICAL: Decode Geez tokens to readable Amharic before adding!
# decode() converts 'ĠáĬł...' back to " አበባ" or "አዲስ"
tokens_to_add = [geez.decode([id]) for id in new_tokens_ids]
# Remove duplicates
decoded_tokens = list(set(tokens_to_add))
print(f"Adding {len(decoded_tokens)} Amharic tokens")
# Add to target tokenizer
target.add_tokens(decoded_tokens)
# Resize model
model = AutoModelForCausalLM.from_pretrained("your-target-model")
model.resize_token_embeddings(len(target))
❌ Common Mistake to Avoid
# WRONG: Adding raw byte tokens (they won't match input text!)
target.add_tokens(list(geez.get_vocab().keys()))
# This adds things like 'áĬłáĭ²áε' - the tokenizer will NEVER find these in text!
# RIGHT: Decode first, then add
decoded = geez.decode([token_id]) # Converts 'áĬłáĭ²áε' → 'አዲስ'
target.add_tokens([decoded]) # Now matches actual input text!
Why this works: Geez-BBPE stores tokens as byte-level strings internally, but the tokenizer looks for actual Amharic text in the input. decode() bridges this gap by converting byte representations back to readable characters.
📊 Intended Use
This tokenizer is best suited for:
- Low-resource NLP pipelines
- Machine Translation
- Question Answering
- Named Entity Recognition
- Morphological analysis
- Continued Pre-Training (CPT) for Amharic/Tigrinya
⚠️ Limitations
- Optimized for Geez-script languages (Amharic, Tigrinya) - may not generalize to others
- Some compound words may still benefit from linguistic preprocessing
- Currently focused on Amharic/Tigrinya; doesn't support multilingual code-switching
✅ Evaluation
The tokenizer was evaluated on:
- Token coverage of Amharic/Tigrinya corpora
- Morphological preservation
- Character-to-token ratio (target: >3.0 chars/token)
- Real-world downstream task performance
📜 License
This tokenizer is licensed under the MIT License.
📌 Citation
@misc{naol2025geezbbpe,
title={Geez-BBPE: An Efficient Byte-Level BPE Tokenizer for Amharic and Tigrinya},
author={Naol},
year={2026},
howpublished={\url{https://huggingface.co/NaolBM/Geez-BBPE}},
}
🙏 Acknowledgments
- Built with 🤗 Hugging Face Transformers
- Trained on
b1n1yam/amharic-combined-corpus - Optimized for efficient Amharic NLP
- Inspired by the Ethiopian AI community's need for better language tools
⭐ If you find this useful, please star the repo! ⭐
Made with ❤️ for Ethiopian AI