π Oromo-BBPE Tokenizer (NaolBM/Oromo-BBPE)
A highly efficient Byte-Level BPE tokenizer specifically designed for Afaan Oromo with a compact 12,000 vocabulary size - achieving 6.13 chars/token on Oromo text, dramatically outperforming general-purpose tokenizers.
π Benchmark Comparison
Test Sentence:
"Dabalataan bu'uuraaleen misoomaa akka daandii qonnaan bultoonni omisha isaanii karaa salphaa ta'een gabaaf akka dhiyeessan carraa uumu himan."
| Model | Tokens | Chars/Token | Efficiency vs. Oromo-BBPE |
|---|---|---|---|
| Oromo-BBPE (Ours) | 23 | 6.13 | baseline |
| Qwen3 | 54 | 2.61 | 2.3x worse |
| Llama-3.2 | 53 | 2.66 | 2.3x worse |
| gemma-3 | 51 | 2.76 | 2.2x worse |
| gpt-oss | 45 | 3.13 | 2.0x worse |
| DeepSeek-V3.2 | 57 | 2.47 | 2.5x worse |
Key Insight: Oromo-BBPE uses 2.3-2.5x fewer tokens than mainstream tokenizers for the same Oromo text, meaning you can fit more than twice the content in the same context window!
π§ Why Oromo-BBPE?
Most tokenizers are trained on English or other high-resource languages, leading to catastrophic fragmentation of Afaan Oromo words. A word like "dubbataman" might become ['d', 'ub', 'bat', 'aman'] - losing all morphological meaning and wasting precious tokens.
Oromo-BBPE solves this by:
- β Semantic preservation - common Oromo words are single tokens
- β Morphological awareness - properly handles Oromo affixes and patterns
- β 2.3x better efficiency - fit more Oromo text in same context window
- β Faster inference - fewer tokens = faster generation (up to 2x speedup)
- β Better learning - model sees meaningful units, not fragments
- β Compact size - only 12K vocabulary (vs 50K+ for general tokenizers)
π Training Details
- Tokenizer Type: Byte-Level BPE (BBPE)
- Vocabulary Size: 12,000 (optimized for Afaan Oromo)
- Training Data: 400K+ rows from
castorini/afriberta-corpus(Afaan Oromo section) - Pre-tokenizer: Byte-Level with space preservation (
Δprefix) - Special Tokens:
<|startoftext|>,<|endoftext|>,<|pad|>
π Files
tokenizer.json: Full tokenizer configurationtokenizer_config.json: Hugging Face-compatible configurationspecial_tokens_map.json: Special tokens mapping
π Usage
Standalone Usage
from transformers import AutoTokenizer
# Load Oromo-BBPE tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Oromo-BBPE")
# Tokenize Afaan Oromo text efficiently!
text = "Dabalataan bu'uuraaleen misoomaa akka daandii qonnaan bultoonni omisha isaanii karaa salphaa ta'een gabaaf akka dhiyeessan carraa uumu himan."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Tokens ({len(tokens)}): {tokens}")
print(f"Token IDs: {ids}")
print(f"Efficiency: {len(text)/len(tokens):.2f} chars/token")
Extending Other Models with Oromo-BBPE
Want to add Oromo capability to any existing model (Llama, Qwen, Gemma, etc.)? Here's how:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizers
oromo = AutoTokenizer.from_pretrained("NaolBM/Oromo-BBPE")
target = AutoTokenizer.from_pretrained("your-target-model") # e.g., "meta-llama/Llama-3.2-1B"
# Get all Oromo token IDs (excluding special tokens)
new_tokens_ids = [
id for id in range(oromo.vocab_size)
if oromo.decode([id]) not in target.all_special_tokens
]
# π CRITICAL: Decode Oromo tokens to readable Afaan Oromo before adding!
# decode() converts 'Δ qabatte' back to " qabatte" or 'Kanaadaan' to "Kanaadaan"
tokens_to_add = [oromo.decode([id]) for id in new_tokens_ids]
# Remove duplicates and empty strings
decoded_tokens = [t for t in set(tokens_to_add) if t.strip()]
print(f"Adding {len(decoded_tokens)} Oromo tokens to {target.__class__.__name__}")
# Add to target tokenizer
target.add_tokens(decoded_tokens)
# Resize model embeddings
model = AutoModelForCausalLM.from_pretrained("your-target-model")
model.resize_token_embeddings(len(target))
# Now your model understands Afaan Oromo!
test_text = "Akka Oromoo dhalatte?"
tokens = target.tokenize(test_text)
print(f"Tokens: {tokens}") # Will show proper Oromo segmentation
β Common Mistake to Avoid
# WRONG: Adding raw byte tokens (they won't match input text!)
target.add_tokens(list(oromo.get_vocab().keys()))
# RIGHT: Decode first, then add
decoded = oromo.decode([token_id]) # Converts byte representation β actual Oromo text
target.add_tokens([decoded]) # Now matches what appears in input!
Why this works: Oromo-BBPE stores tokens with byte-level prefixes (like Δ for spaces), but when extending another tokenizer, you need the actual visible text that appears in your training corpus. decode() bridges this gap.
π Intended Use
This tokenizer is ideal for:
- Afaan Oromo language models - foundation for Oromo LLMs
- Machine Translation - English β Oromo translation systems
- Speech-to-Text - Oromo ASR systems
- Named Entity Recognition - Oromo NER pipelines
- Sentiment Analysis - Oromo social media monitoring
- Educational Tools - Oromo language learning applications
- Continued Pre-Training (CPT) - Adding Oromo to existing multilingual models
π Performance Highlights
- 6.13 chars/token on Oromo text (2.5x better than DeepSeek)
- 400K+ rows of authentic Afaan Oromo training data
- 12K compact vocabulary - efficient storage and inference
- Byte-level operation - handles any Unicode character
- Space preservation - perfect for reconstructing original text
β οΈ Limitations
- Optimized specifically for Afaan Oromo - may not generalize well to other languages
- Trained primarily on
castorini/afriberta-corpus- domain may be slightly biased toward news/textbook content - English loanwords and code-switching may show lower efficiency (3.32 chars/token in mixed text)
- Currently focused on monolingual Oromo; doesn't optimize for multilingual mixing
π¬ Evaluation Methodology
The tokenizer was evaluated on:
- Token efficiency (chars/token ratio) - target > 5.0 chars/token
- Morphological preservation - proper handling of Oromo affixes
- Out-of-vocabulary rate on held-out test data
- Reconstruction accuracy - ability to preserve original text
- Cross-model comparison against 5 major tokenizers
π License
This tokenizer is licensed under the MIT License - free for commercial and research use.
π Citation
@misc{naol2026oromobbpe,
title={Oromo-BBPE: An Efficient Byte-Level BPE Tokenizer for Afaan Oromo},
author={Naol},
year={2026},
howpublished={\url{https://huggingface.co/NaolBM/Oromo-BBPE}},
}
πΎ Download
Available on Hugging Face Hub:
NaolBM/Oromo-BBPE
π Acknowledgments
- Built with π€ Hugging Face Tokenizers library
- Trained on
castorini/afriberta-corpus(Afaan Oromo section) - Inspired by the need for better Oromo NLP tools
- Thanks to the Ethiopian NLP community for resources and motivation
- Benchmark comparisons with Qwen, Llama, Gemma, GPT-OSS, and DeepSeek
π€ Contributing
Contributions welcome! Areas for improvement:
- More diverse training data (social media, spoken Oromo)
- Dialectal variation handling
- Integration with downstream task benchmarks
β If you find this useful for Oromo NLP, please star the repo! β
Made with β€οΈ for Afaan Oromo and the Ethiopian AI community
π€ "Afaan Oromo afaan keenya, teknooloojiin keenya!"