🚀 Geez-BBPE Tokenizer (NaolBM/Geez-BBPE)

A highly efficient Byte-Level BPE tokenizer specifically trained for Geez-script languages (Amharic, Tigrinya). With 25,000 vocabulary size, it achieves 3.67 chars/token on Amharic text - significantly outperforming generic tokenizers.

📊 Benchmark Comparison

Model Tokens Chars/Token
Geez-BBPE (Ours) 9 4.00
Qwen3 47 0.77
Llama-3.2 87 0.41
gemma-3 17 2.12
gpt-oss 65 0.55

Sample: "የኢትዮጵያ ታሪክ በጣም ረጅም እና ባለብዙ ደረጃዎች ነው።" (8 characters)

🧠 Why Geez-BBPE?

Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently, breaking words into meaningless byte sequences.

Geez-BBPE solves this by:

  • ✅ Preserving semantic meaning - common words are single tokens
  • ✅ Reducing sequence length - fit more text in same context
  • ✅ Faster inference - fewer tokens = faster generation
  • ✅ Better learning - model sees whole words, not fragments
  • ✅ Native support - trained on 10M+ authentic Amharic sentences

📚 Training Details

  • Tokenizer Type: Byte-Level BPE
  • Vocabulary Size: 25,000 (optimized for Geez script)
  • Training Data: 10M+ sentences from b1n1yam/amharic-combined-corpus
  • Pre-tokenizer: Byte-Level with space preservation
  • Special Tokens: <|startoftext|>, <|im_end|>, <|pad|>

📁 Files

  • tokenizer.json: Full tokenizer config
  • tokenizer_config.json: Hugging Face-compatible configuration
  • special_tokens_map.json: Maps for special tokens

🚀 Usage

Standalone Usage

from transformers import AutoTokenizer

# Load Geez-BBPE tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Geez-BBPE")

# Tokenize Amharic text efficiently!
text = "ኢትዮጵያ ዋና ከተማዋ አዲስ አበባ ናት።"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Tokens ({len(tokens)}): {tokens}")
print(f"Token IDs: {ids}")

Extending Other Models with Geez-BBPE

If you want to extend another model's tokenizer (like Qwen, llama, LMF etc...) using Geez-BBPE:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizers
geez = AutoTokenizer.from_pretrained("NaolBM/Geez-BBPE")
target = AutoTokenizer.from_pretrained("your-target-model")

new_tokens_ids = [
    id for id in range(geez.vocab_size)
    if geez.decode([id]) not in target.all_special_tokens
]

# 🔑 CRITICAL: Decode Geez tokens to readable Amharic before adding!
# decode() converts 'ĠáĬł...' back to " አበባ" or "አዲስ"
tokens_to_add = [geez.decode([id]) for id in new_tokens_ids]

# Remove duplicates
decoded_tokens = list(set(tokens_to_add))
print(f"Adding {len(decoded_tokens)} Amharic tokens")

# Add to target tokenizer
target.add_tokens(decoded_tokens)

# Resize model
model = AutoModelForCausalLM.from_pretrained("your-target-model")
model.resize_token_embeddings(len(target))

❌ Common Mistake to Avoid

# WRONG: Adding raw byte tokens (they won't match input text!)
target.add_tokens(list(geez.get_vocab().keys()))  
# This adds things like 'áĬłáĭ²áε' - the tokenizer will NEVER find these in text!

# RIGHT: Decode first, then add
decoded = geez.decode([token_id])  # Converts 'áĬłáĭ²áε' → 'አዲስ'
target.add_tokens([decoded])  # Now matches actual input text!

Why this works: Geez-BBPE stores tokens as byte-level strings internally, but the tokenizer looks for actual Amharic text in the input. decode() bridges this gap by converting byte representations back to readable characters.

📊 Intended Use

This tokenizer is best suited for:

  • Low-resource NLP pipelines
  • Machine Translation
  • Question Answering
  • Named Entity Recognition
  • Morphological analysis
  • Continued Pre-Training (CPT) for Amharic/Tigrinya

⚠️ Limitations

  • Optimized for Geez-script languages (Amharic, Tigrinya) - may not generalize to others
  • Some compound words may still benefit from linguistic preprocessing
  • Currently focused on Amharic/Tigrinya; doesn't support multilingual code-switching

✅ Evaluation

The tokenizer was evaluated on:

  • Token coverage of Amharic/Tigrinya corpora
  • Morphological preservation
  • Character-to-token ratio (target: >3.0 chars/token)
  • Real-world downstream task performance

📜 License

This tokenizer is licensed under the MIT License.

📌 Citation

@misc{naol2025geezbbpe,
  title={Geez-BBPE: An Efficient Byte-Level BPE Tokenizer for Amharic and Tigrinya},
  author={Naol},
  year={2026},
  howpublished={\url{https://huggingface.co/NaolBM/Geez-BBPE}},
}

🙏 Acknowledgments

  • Built with 🤗 Hugging Face Transformers
  • Trained on b1n1yam/amharic-combined-corpus
  • Optimized for efficient Amharic NLP
  • Inspired by the Ethiopian AI community's need for better language tools

⭐ If you find this useful, please star the repo! ⭐

Made with ❤️ for Ethiopian AI

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train NaolBM/Geez-BBPE