🇪🇹 EthioBBPE - Amharic Biblical Tokenizer
A production-ready Byte-level BPE tokenizer specifically trained on Amharic biblical and religious texts, achieving accurate reconstruction of complex Ge'ez script, ancient punctuation, and liturgical content.
✨ Features
- ✅ Accurate Reconstruction: High accuracy on all test samples including ancient Ge'ez punctuation
- ✅ Specialized Vocabulary: Trained on 61,769 lines of Amharic biblical texts (Synaxarium + Canon Bible)
- ✅ Compressed Storage: Gzip compression (level 9) reduces model size by 89.8% (1.3MB → 136KB)
- ✅ Production Ready: Checkpointing, metrics tracking, and comprehensive error handling
- ✅ Ge'ez Script Support: Full support for Ethiopic characters, numerals, and liturgical punctuation marks
🔗 Resources
- Source Code: GitHub Repository
- Issue Tracker: GitHub Issues
- PyPI Package: EthioBBPE on PyPI
📊 Training Data
| Dataset | Source | Texts | Description |
|---|---|---|---|
| Synaxarium | Nexuss0781/synaxarium | 366 | Daily synaxarium readings in Amharic |
| Canon Biblical | Nexuss0781/conon-biblical-am-en | 61,403 | Amharic-English biblical texts |
| Total | - | 61,769 | 15.43 MB combined corpus |
Training Configuration
{
"vocab_size": 16000,
"min_frequency": 2,
"special_tokens": ["<pad>", "<unk>", "<s>", "</s>", "<mask>"],
"lowercase": false,
"compression": "gzip (level 9)",
"checkpointing": true
}
🎯 Performance Metrics
| Metric | Result |
|---|---|
| Accurate Reconstruction | ✅ High accuracy |
| Ge'ez Punctuation | ✅ Accurate (1 token for ፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠) |
| Synaxarium Text | ✅ Accurate (66 tokens) |
| Biblical Text | ✅ Accurate (82 tokens) |
| Compression Ratio | 89.8% (1.3MB → 136KB) |
| Training Time | ~17 seconds |
🚀 Quick Start
Installation
pip install tokenizers huggingface_hub
Load from Hugging Face Hub
from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download
# Download and load tokenizer
tokenizer_path = hf_hub_download("Nexuss0781/Ethio-BBPE", "tokenizer.json")
tokenizer = Tokenizer.from_file(tokenizer_path)
# Encode Amharic text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")
Direct File Loading
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")
# Test with ancient Ge'ez punctuation
text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(text)
print(f"Encoded {len(text)} chars into {len(encoded.ids)} token(s)")
# Output: Encoded 16 chars into 1 token(s)
Using Compressed Vocabulary
import gzip
import json
from tokenizers import Tokenizer, AddedToken
# Load compressed vocabulary
with gzip.open('models/EthioBBPE/vocab.json.gz', 'rt', encoding='utf-8') as f:
vocab = json.load(f)
print(f"Vocabulary size: {len(vocab)}")
print(f"Storage saved: ~89.8%")
📝 Example Usage
Encoding Biblical Text
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")
# Synaxarium text
synaxarium = """ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ።"""
encoded = tokenizer.encode(synaxarium)
print(f"Original: {synaxarium}")
print(f"Tokens: {encoded.tokens}")
print(f"Token count: {len(encoded.ids)}")
print(f"Reconstructed: {tokenizer.decode(encoded.ids)}")
print(f"Perfect match: {synaxarium == tokenizer.decode(encoded.ids)}")
Batch Processing
texts = [
"በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
"ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
"፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
]
encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
print(f"Text {i+1}: {len(enc.ids)} tokens")
📁 Model Files
| File | Size | Description |
|---|---|---|
tokenizer.json |
1.3 MB | Standard tokenizer format |
vocab.json.gz |
136 KB | Compressed vocabulary (89.8% smaller) |
config.json |
431 B | Training configuration |
training_metrics.json |
1.2 KB | Comprehensive training metrics |
README.md |
- | This documentation |
🔬 Technical Details
Architecture
- Type: Byte-level BPE (BBPE)
- Vocabulary Size: 16,000 tokens
- Special Tokens:
<pad>,<unk>,<s>,</s>,<mask> - Minimum Frequency: 2 occurrences
Preprocessing
- No lowercasing (preserves Ge'ez case distinctions)
- No prefix space (optimal for Amharic morphology)
- Unicode normalization enabled
Compression
- Algorithm: Gzip (level 9)
- Original Size: 1.3 MB
- Compressed Size: 136 KB
- Space Saved: 89.8%
🧪 Testing & Validation
All test cases achieve accurate reconstruction:
test_cases = [
("Ge'ez Punctuation", "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"),
("Synaxarium", "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"),
("Biblical", "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ")
]
for name, text in test_cases:
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded.ids)
assert text == decoded, f"{name} failed!"
print(f"✅ {name}: Accurate ({len(encoded.ids)} tokens)")
📚 Datasets
This tokenizer was trained on two specialized Amharic biblical datasets:
- Synaxarium Dataset: Daily readings from the Ethiopian Orthodox Synaxarium containing lives of saints and biblical narratives
- Canon Biblical Dataset: Comprehensive Amharic-English parallel biblical texts
Both datasets are available on Hugging Face under the Nexuss0781 organization.
🛠️ Advanced Features
Checkpointing
Automatic checkpointing during training allows resumption from interruptions:
python scripts/train_tokenizer.py --data_dir ./data --use_checkpoint
Custom Vocabulary Size
python scripts/train_tokenizer.py --data_dir ./data --vocab_size 32000
Alternative Compression
python scripts/train_tokenizer.py --data_dir ./data --save_compressed
# Supports: gzip, bz2, lzma
📄 License
Apache License 2.0 - See LICENSE for details.
🙏 Acknowledgments
- Datasets: Nexuss0781/synaxarium and Nexuss0781/conon-biblical-am-en
- Library: Hugging Face Tokenizers
- Script: Ethiopic (Ge'ez) Unicode block U+1200–U+137F
📬 Contact & Support
- GitHub: nexuss0781/Ethio_BBPE
- Hugging Face: Nexuss0781/Ethio-BBPE
- PyPI: EthioBBPE Package
- Issues: Please open an issue on GitHub for bugs or feature requests
Made with ❤️ for the Amharic NLP Community
Last Updated: May 2026
- Downloads last month
- 1,082
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support