🇪🇹 EthioBBPE - Amharic Biblical Tokenizer

License: Apache 2.0 Hugging Face Amharic Tokenizer Type GitHub

A production-ready Byte-level BPE tokenizer specifically trained on Amharic biblical and religious texts, achieving accurate reconstruction of complex Ge'ez script, ancient punctuation, and liturgical content.

✨ Features

  • Accurate Reconstruction: High accuracy on all test samples including ancient Ge'ez punctuation
  • Specialized Vocabulary: Trained on 61,769 lines of Amharic biblical texts (Synaxarium + Canon Bible)
  • Compressed Storage: Gzip compression (level 9) reduces model size by 89.8% (1.3MB → 136KB)
  • Production Ready: Checkpointing, metrics tracking, and comprehensive error handling
  • Ge'ez Script Support: Full support for Ethiopic characters, numerals, and liturgical punctuation marks

🔗 Resources

📊 Training Data

Dataset Source Texts Description
Synaxarium Nexuss0781/synaxarium 366 Daily synaxarium readings in Amharic
Canon Biblical Nexuss0781/conon-biblical-am-en 61,403 Amharic-English biblical texts
Total - 61,769 15.43 MB combined corpus

Training Configuration

{
  "vocab_size": 16000,
  "min_frequency": 2,
  "special_tokens": ["<pad>", "<unk>", "<s>", "</s>", "<mask>"],
  "lowercase": false,
  "compression": "gzip (level 9)",
  "checkpointing": true
}

🎯 Performance Metrics

Metric Result
Accurate Reconstruction ✅ High accuracy
Ge'ez Punctuation ✅ Accurate (1 token for ፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠)
Synaxarium Text ✅ Accurate (66 tokens)
Biblical Text ✅ Accurate (82 tokens)
Compression Ratio 89.8% (1.3MB → 136KB)
Training Time ~17 seconds

🚀 Quick Start

Installation

pip install tokenizers huggingface_hub

Load from Hugging Face Hub

from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

# Download and load tokenizer
tokenizer_path = hf_hub_download("Nexuss0781/Ethio-BBPE", "tokenizer.json")
tokenizer = Tokenizer.from_file(tokenizer_path)

# Encode Amharic text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)

print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")

Direct File Loading

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")

# Test with ancient Ge'ez punctuation
text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(text)
print(f"Encoded {len(text)} chars into {len(encoded.ids)} token(s)")
# Output: Encoded 16 chars into 1 token(s)

Using Compressed Vocabulary

import gzip
import json
from tokenizers import Tokenizer, AddedToken

# Load compressed vocabulary
with gzip.open('models/EthioBBPE/vocab.json.gz', 'rt', encoding='utf-8') as f:
    vocab = json.load(f)

print(f"Vocabulary size: {len(vocab)}")
print(f"Storage saved: ~89.8%")

📝 Example Usage

Encoding Biblical Text

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("models/EthioBBPE/tokenizer.json")

# Synaxarium text
synaxarium = """ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ።"""
encoded = tokenizer.encode(synaxarium)

print(f"Original: {synaxarium}")
print(f"Tokens: {encoded.tokens}")
print(f"Token count: {len(encoded.ids)}")
print(f"Reconstructed: {tokenizer.decode(encoded.ids)}")
print(f"Perfect match: {synaxarium == tokenizer.decode(encoded.ids)}")

Batch Processing

texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
    "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
]

encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
    print(f"Text {i+1}: {len(enc.ids)} tokens")

📁 Model Files

File Size Description
tokenizer.json 1.3 MB Standard tokenizer format
vocab.json.gz 136 KB Compressed vocabulary (89.8% smaller)
config.json 431 B Training configuration
training_metrics.json 1.2 KB Comprehensive training metrics
README.md - This documentation

🔬 Technical Details

Architecture

  • Type: Byte-level BPE (BBPE)
  • Vocabulary Size: 16,000 tokens
  • Special Tokens: <pad>, <unk>, <s>, </s>, <mask>
  • Minimum Frequency: 2 occurrences

Preprocessing

  • No lowercasing (preserves Ge'ez case distinctions)
  • No prefix space (optimal for Amharic morphology)
  • Unicode normalization enabled

Compression

  • Algorithm: Gzip (level 9)
  • Original Size: 1.3 MB
  • Compressed Size: 136 KB
  • Space Saved: 89.8%

🧪 Testing & Validation

All test cases achieve accurate reconstruction:

test_cases = [
    ("Ge'ez Punctuation", "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"),
    ("Synaxarium", "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"),
    ("Biblical", "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ")
]

for name, text in test_cases:
    encoded = tokenizer.encode(text)
    decoded = tokenizer.decode(encoded.ids)
    assert text == decoded, f"{name} failed!"
    print(f"✅ {name}: Accurate ({len(encoded.ids)} tokens)")

📚 Datasets

This tokenizer was trained on two specialized Amharic biblical datasets:

  1. Synaxarium Dataset: Daily readings from the Ethiopian Orthodox Synaxarium containing lives of saints and biblical narratives
  2. Canon Biblical Dataset: Comprehensive Amharic-English parallel biblical texts

Both datasets are available on Hugging Face under the Nexuss0781 organization.

🛠️ Advanced Features

Checkpointing

Automatic checkpointing during training allows resumption from interruptions:

python scripts/train_tokenizer.py --data_dir ./data --use_checkpoint

Custom Vocabulary Size

python scripts/train_tokenizer.py --data_dir ./data --vocab_size 32000

Alternative Compression

python scripts/train_tokenizer.py --data_dir ./data --save_compressed
# Supports: gzip, bz2, lzma

📄 License

Apache License 2.0 - See LICENSE for details.

🙏 Acknowledgments

📬 Contact & Support


Made with ❤️ for the Amharic NLP Community

Last Updated: May 2026

Downloads last month
1,082
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Nexuss0781/Ethio-BBPE

Space using Nexuss0781/Ethio-BBPE 1