🇪🇹 EthioBBPE - Amharic Biblical Tokenizer

A production-ready Byte-level BPE tokenizer specifically trained on Amharic biblical and religious texts, achieving accurate reconstruction of complex Ge'ez script, ancient punctuation, and liturgical content.

✨ Features

✅ Accurate Reconstruction: High accuracy on all test samples including ancient Ge'ez punctuation
✅ Specialized Vocabulary: Trained on 61,769 lines of Amharic biblical texts (Synaxarium + Canon Bible)
✅ Compressed Storage: Gzip compression (level 9) reduces model size by 89.8% (1.3MB → 136KB)
✅ Production Ready: Checkpointing, metrics tracking, and comprehensive error handling
✅ Ge'ez Script Support: Full support for Ethiopic characters, numerals, and liturgical punctuation marks

🔗 Resources

Source Code: GitHub Repository
Issue Tracker: GitHub Issues
PyPI Package: EthioBBPE on PyPI

📊 Training Data

Dataset	Source	Texts	Description
Synaxarium	Nexuss0781/synaxarium	366	Daily synaxarium readings in Amharic
Canon Biblical	Nexuss0781/conon-biblical-am-en	61,403	Amharic-English biblical texts
Total	-	61,769	15.43 MB combined corpus

Training Configuration

{
  "vocab_size": 16000,
  "min_frequency": 2,
  "special_tokens": ["<pad>", "<unk>", "<s>", "</s>", "<mask>"],
  "lowercase": false,
  "compression": "gzip (level 9)",
  "checkpointing": true
}

🎯 Performance Metrics

Metric	Result
Accurate Reconstruction	✅ High accuracy
Ge'ez Punctuation	✅ Accurate (1 token for `፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠`)
Synaxarium Text	✅ Accurate (66 tokens)
Biblical Text	✅ Accurate (82 tokens)
Compression Ratio	89.8% (1.3MB → 136KB)
Training Time	~17 seconds

🚀 Quick Start

Installation

pip install tokenizers huggingface_hub

Load from Hugging Face Hub

from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

# Download and load tokenizer
tokenizer_path = hf_hub_download("Nexuss0781/Ethio-BBPE", "tokenizer.json")
tokenizer = Tokenizer.from_file(tokenizer_path)

# Encode Amharic text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)

print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")

Direct File Loading

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")

# Test with ancient Ge'ez punctuation
text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(text)
print(f"Encoded {len(text)} chars into {len(encoded.ids)} token(s)")
# Output: Encoded 16 chars into 1 token(s)

Using Compressed Vocabulary

import gzip
import json
from tokenizers import Tokenizer, AddedToken

# Load compressed vocabulary
with gzip.open('models/EthioBBPE/vocab.json.gz', 'rt', encoding='utf-8') as f:
    vocab = json.load(f)

print(f"Vocabulary size: {len(vocab)}")
print(f"Storage saved: ~89.8%")

📝 Example Usage

Encoding Biblical Text

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")

# Synaxarium text
synaxarium = """ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ።"""
encoded = tokenizer.encode(synaxarium)

print(f"Original: {synaxarium}")
print(f"Tokens: {encoded.tokens}")
print(f"Token count: {len(encoded.ids)}")
print(f"Reconstructed: {tokenizer.decode(encoded.ids)}")
print(f"Perfect match: {synaxarium == tokenizer.decode(encoded.ids)}")

Batch Processing

texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
    "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
]

encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
    print(f"Text {i+1}: {len(enc.ids)} tokens")

📁 Model Files

File	Size	Description
`tokenizer.json`	1.3 MB	Standard tokenizer format
`vocab.json.gz`	136 KB	Compressed vocabulary (89.8% smaller)
`config.json`	431 B	Training configuration
`training_metrics.json`	1.2 KB	Comprehensive training metrics
`README.md`	-	This documentation

🔬 Technical Details

Architecture

Type: Byte-level BPE (BBPE)
Vocabulary Size: 16,000 tokens
Special Tokens: <pad>, <unk>, <s>, </s>, <mask>
Minimum Frequency: 2 occurrences

Preprocessing

No lowercasing (preserves Ge'ez case distinctions)
No prefix space (optimal for Amharic morphology)
Unicode normalization enabled

Compression

Algorithm: Gzip (level 9)
Original Size: 1.3 MB
Compressed Size: 136 KB
Space Saved: 89.8%

🧪 Testing & Validation

All test cases achieve accurate reconstruction:

test_cases = [
    ("Ge'ez Punctuation", "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"),
    ("Synaxarium", "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"),
    ("Biblical", "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ")
]

for name, text in test_cases:
    encoded = tokenizer.encode(text)
    decoded = tokenizer.decode(encoded.ids)
    assert text == decoded, f"{name} failed!"
    print(f"✅ {name}: Accurate ({len(encoded.ids)} tokens)")

📚 Datasets

This tokenizer was trained on two specialized Amharic biblical datasets:

Synaxarium Dataset: Daily readings from the Ethiopian Orthodox Synaxarium containing lives of saints and biblical narratives
Canon Biblical Dataset: Comprehensive Amharic-English parallel biblical texts

Both datasets are available on Hugging Face under the Nexuss0781 organization.

🛠️ Advanced Features

Checkpointing

Automatic checkpointing during training allows resumption from interruptions:

python scripts/train_tokenizer.py --data_dir ./data --use_checkpoint

Custom Vocabulary Size

python scripts/train_tokenizer.py --data_dir ./data --vocab_size 32000

Alternative Compression

python scripts/train_tokenizer.py --data_dir ./data --save_compressed
# Supports: gzip, bz2, lzma

📄 License

Apache License 2.0 - See LICENSE for details.

🙏 Acknowledgments

Datasets: Nexuss0781/synaxarium and Nexuss0781/conon-biblical-am-en
Library: Hugging Face Tokenizers
Script: Ethiopic (Ge'ez) Unicode block U+1200–U+137F

📬 Contact & Support

GitHub: nexuss0781/Ethio_BBPE
Hugging Face: Nexuss0781/Ethio-BBPE
PyPI: EthioBBPE Package
Issues: Please open an issue on GitHub for bugs or feature requests

Made with ❤️ for the Amharic NLP Community

Last Updated: May 2026

Downloads last month: 1,082

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Nexuss0781
/

Ethio-BBPE