πŸš€ Oromo-BBPE Tokenizer (NaolBM/Oromo-BBPE)

A highly efficient Byte-Level BPE tokenizer specifically designed for Afaan Oromo with a compact 12,000 vocabulary size - achieving 6.13 chars/token on Oromo text, dramatically outperforming general-purpose tokenizers.

πŸ“Š Benchmark Comparison

Test Sentence:

"Dabalataan bu'uuraaleen misoomaa akka daandii qonnaan bultoonni omisha isaanii karaa salphaa ta'een gabaaf akka dhiyeessan carraa uumu himan."

Model Tokens Chars/Token Efficiency vs. Oromo-BBPE
Oromo-BBPE (Ours) 23 6.13 baseline
Qwen3 54 2.61 2.3x worse
Llama-3.2 53 2.66 2.3x worse
gemma-3 51 2.76 2.2x worse
gpt-oss 45 3.13 2.0x worse
DeepSeek-V3.2 57 2.47 2.5x worse

Key Insight: Oromo-BBPE uses 2.3-2.5x fewer tokens than mainstream tokenizers for the same Oromo text, meaning you can fit more than twice the content in the same context window!

🧠 Why Oromo-BBPE?

Most tokenizers are trained on English or other high-resource languages, leading to catastrophic fragmentation of Afaan Oromo words. A word like "dubbataman" might become ['d', 'ub', 'bat', 'aman'] - losing all morphological meaning and wasting precious tokens.

Oromo-BBPE solves this by:

  • βœ… Semantic preservation - common Oromo words are single tokens
  • βœ… Morphological awareness - properly handles Oromo affixes and patterns
  • βœ… 2.3x better efficiency - fit more Oromo text in same context window
  • βœ… Faster inference - fewer tokens = faster generation (up to 2x speedup)
  • βœ… Better learning - model sees meaningful units, not fragments
  • βœ… Compact size - only 12K vocabulary (vs 50K+ for general tokenizers)

πŸ“š Training Details

  • Tokenizer Type: Byte-Level BPE (BBPE)
  • Vocabulary Size: 12,000 (optimized for Afaan Oromo)
  • Training Data: 400K+ rows from castorini/afriberta-corpus (Afaan Oromo section)
  • Pre-tokenizer: Byte-Level with space preservation (Δ  prefix)
  • Special Tokens: <|startoftext|>, <|endoftext|>, <|pad|>

πŸ“ Files

  • tokenizer.json: Full tokenizer configuration
  • tokenizer_config.json: Hugging Face-compatible configuration
  • special_tokens_map.json: Special tokens mapping

πŸš€ Usage

Standalone Usage

from transformers import AutoTokenizer

# Load Oromo-BBPE tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Oromo-BBPE")

# Tokenize Afaan Oromo text efficiently!
text = "Dabalataan bu'uuraaleen misoomaa akka daandii qonnaan bultoonni omisha isaanii karaa salphaa ta'een gabaaf akka dhiyeessan carraa uumu himan."
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Text: {text}")
print(f"Tokens ({len(tokens)}): {tokens}")
print(f"Token IDs: {ids}")
print(f"Efficiency: {len(text)/len(tokens):.2f} chars/token")

Extending Other Models with Oromo-BBPE

Want to add Oromo capability to any existing model (Llama, Qwen, Gemma, etc.)? Here's how:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizers
oromo = AutoTokenizer.from_pretrained("NaolBM/Oromo-BBPE")
target = AutoTokenizer.from_pretrained("your-target-model")  # e.g., "meta-llama/Llama-3.2-1B"

# Get all Oromo token IDs (excluding special tokens)
new_tokens_ids = [
    id for id in range(oromo.vocab_size)
    if oromo.decode([id]) not in target.all_special_tokens
]

# πŸ”‘ CRITICAL: Decode Oromo tokens to readable Afaan Oromo before adding!
# decode() converts 'Δ qabatte' back to " qabatte" or 'Kanaadaan' to "Kanaadaan"
tokens_to_add = [oromo.decode([id]) for id in new_tokens_ids]

# Remove duplicates and empty strings
decoded_tokens = [t for t in set(tokens_to_add) if t.strip()]
print(f"Adding {len(decoded_tokens)} Oromo tokens to {target.__class__.__name__}")

# Add to target tokenizer
target.add_tokens(decoded_tokens)

# Resize model embeddings
model = AutoModelForCausalLM.from_pretrained("your-target-model")
model.resize_token_embeddings(len(target))

# Now your model understands Afaan Oromo!
test_text = "Akka Oromoo dhalatte?"
tokens = target.tokenize(test_text)
print(f"Tokens: {tokens}")  # Will show proper Oromo segmentation

❌ Common Mistake to Avoid

# WRONG: Adding raw byte tokens (they won't match input text!)
target.add_tokens(list(oromo.get_vocab().keys()))  

# RIGHT: Decode first, then add
decoded = oromo.decode([token_id])  # Converts byte representation β†’ actual Oromo text
target.add_tokens([decoded])  # Now matches what appears in input!

Why this works: Oromo-BBPE stores tokens with byte-level prefixes (like Δ  for spaces), but when extending another tokenizer, you need the actual visible text that appears in your training corpus. decode() bridges this gap.

πŸ“Š Intended Use

This tokenizer is ideal for:

  • Afaan Oromo language models - foundation for Oromo LLMs
  • Machine Translation - English ↔ Oromo translation systems
  • Speech-to-Text - Oromo ASR systems
  • Named Entity Recognition - Oromo NER pipelines
  • Sentiment Analysis - Oromo social media monitoring
  • Educational Tools - Oromo language learning applications
  • Continued Pre-Training (CPT) - Adding Oromo to existing multilingual models

πŸ† Performance Highlights

  • 6.13 chars/token on Oromo text (2.5x better than DeepSeek)
  • 400K+ rows of authentic Afaan Oromo training data
  • 12K compact vocabulary - efficient storage and inference
  • Byte-level operation - handles any Unicode character
  • Space preservation - perfect for reconstructing original text

⚠️ Limitations

  • Optimized specifically for Afaan Oromo - may not generalize well to other languages
  • Trained primarily on castorini/afriberta-corpus - domain may be slightly biased toward news/textbook content
  • English loanwords and code-switching may show lower efficiency (3.32 chars/token in mixed text)
  • Currently focused on monolingual Oromo; doesn't optimize for multilingual mixing

πŸ”¬ Evaluation Methodology

The tokenizer was evaluated on:

  • Token efficiency (chars/token ratio) - target > 5.0 chars/token
  • Morphological preservation - proper handling of Oromo affixes
  • Out-of-vocabulary rate on held-out test data
  • Reconstruction accuracy - ability to preserve original text
  • Cross-model comparison against 5 major tokenizers

πŸ“œ License

This tokenizer is licensed under the MIT License - free for commercial and research use.

πŸ“Œ Citation

@misc{naol2026oromobbpe,
  title={Oromo-BBPE: An Efficient Byte-Level BPE Tokenizer for Afaan Oromo},
  author={Naol},
  year={2026},
  howpublished={\url{https://huggingface.co/NaolBM/Oromo-BBPE}},
}

πŸ’Ύ Download

Available on Hugging Face Hub:

NaolBM/Oromo-BBPE

πŸ™ Acknowledgments

  • Built with πŸ€— Hugging Face Tokenizers library
  • Trained on castorini/afriberta-corpus (Afaan Oromo section)
  • Inspired by the need for better Oromo NLP tools
  • Thanks to the Ethiopian NLP community for resources and motivation
  • Benchmark comparisons with Qwen, Llama, Gemma, GPT-OSS, and DeepSeek

🀝 Contributing

Contributions welcome! Areas for improvement:

  • More diverse training data (social media, spoken Oromo)
  • Dialectal variation handling
  • Integration with downstream task benchmarks

⭐ If you find this useful for Oromo NLP, please star the repo! ⭐

Made with ❀️ for Afaan Oromo and the Ethiopian AI community

πŸ”€ "Afaan Oromo afaan keenya, teknooloojiin keenya!"

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train NaolBM/Oromo-BBPE