🚀 Geez-BBPE Tokenizer (`NaolBM/Geez-BBPE`)

A highly efficient Byte-Level BPE tokenizer specifically trained for Geez-script languages (Amharic, Tigrinya). With 25,000 vocabulary size, it achieves 3.67 chars/token on Amharic text - significantly outperforming generic tokenizers.

📊 Benchmark Comparison

Model	Tokens	Chars/Token
Geez-BBPE (Ours)	9	4.00
Qwen3	47	0.77
Llama-3.2	87	0.41
gemma-3	17	2.12
gpt-oss	65	0.55

Sample: "የኢትዮጵያ ታሪክ በጣም ረጅም እና ባለብዙ ደረጃዎች ነው።" (8 characters)

🧠 Why Geez-BBPE?

Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently, breaking words into meaningless byte sequences.

Geez-BBPE solves this by:

✅ Preserving semantic meaning - common words are single tokens
✅ Reducing sequence length - fit more text in same context
✅ Faster inference - fewer tokens = faster generation
✅ Better learning - model sees whole words, not fragments
✅ Native support - trained on 10M+ authentic Amharic sentences

📚 Training Details

Tokenizer Type: Byte-Level BPE
Vocabulary Size: 25,000 (optimized for Geez script)
Training Data: 10M+ sentences from b1n1yam/amharic-combined-corpus
Pre-tokenizer: Byte-Level with space preservation
Special Tokens: <|startoftext|>, <|im_end|>, <|pad|>

📁 Files

tokenizer.json: Full tokenizer config
tokenizer_config.json: Hugging Face-compatible configuration
special_tokens_map.json: Maps for special tokens

🚀 Usage

Standalone Usage

from transformers import AutoTokenizer

# Load Geez-BBPE tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Geez-BBPE")

# Tokenize Amharic text efficiently!
text = "ኢትዮጵያ ዋና ከተማዋ አዲስ አበባ ናት።"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Tokens ({len(tokens)}): {tokens}")
print(f"Token IDs: {ids}")

Extending Other Models with Geez-BBPE

If you want to extend another model's tokenizer (like Qwen, llama, LMF etc...) using Geez-BBPE:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizers
geez = AutoTokenizer.from_pretrained("NaolBM/Geez-BBPE")
target = AutoTokenizer.from_pretrained("your-target-model")

new_tokens_ids = [
    id for id in range(geez.vocab_size)
    if geez.decode([id]) not in target.all_special_tokens
]

# 🔑 CRITICAL: Decode Geez tokens to readable Amharic before adding!
# decode() converts 'ĠáĬł...' back to " አበባ" or "አዲስ"
tokens_to_add = [geez.decode([id]) for id in new_tokens_ids]

# Remove duplicates
decoded_tokens = list(set(tokens_to_add))
print(f"Adding {len(decoded_tokens)} Amharic tokens")

# Add to target tokenizer
target.add_tokens(decoded_tokens)

# Resize model
model = AutoModelForCausalLM.from_pretrained("your-target-model")
model.resize_token_embeddings(len(target))

❌ Common Mistake to Avoid

# WRONG: Adding raw byte tokens (they won't match input text!)
target.add_tokens(list(geez.get_vocab().keys()))  
# This adds things like 'áĬłáĭ²áĪµ' - the tokenizer will NEVER find these in text!

# RIGHT: Decode first, then add
decoded = geez.decode([token_id])  # Converts 'áĬłáĭ²áĪµ' → 'አዲስ'
target.add_tokens([decoded])  # Now matches actual input text!

Why this works: Geez-BBPE stores tokens as byte-level strings internally, but the tokenizer looks for actual Amharic text in the input. decode() bridges this gap by converting byte representations back to readable characters.

📊 Intended Use

This tokenizer is best suited for:

Low-resource NLP pipelines
Machine Translation
Question Answering
Named Entity Recognition
Morphological analysis
Continued Pre-Training (CPT) for Amharic/Tigrinya

⚠️ Limitations

Optimized for Geez-script languages (Amharic, Tigrinya) - may not generalize to others
Some compound words may still benefit from linguistic preprocessing
Currently focused on Amharic/Tigrinya; doesn't support multilingual code-switching

✅ Evaluation

The tokenizer was evaluated on:

Token coverage of Amharic/Tigrinya corpora
Morphological preservation
Character-to-token ratio (target: >3.0 chars/token)
Real-world downstream task performance

📜 License

This tokenizer is licensed under the MIT License.

📌 Citation

@misc{naol2025geezbbpe,
  title={Geez-BBPE: An Efficient Byte-Level BPE Tokenizer for Amharic and Tigrinya},
  author={Naol},
  year={2026},
  howpublished={\url{https://huggingface.co/NaolBM/Geez-BBPE}},
}

🙏 Acknowledgments

Built with 🤗 Hugging Face Transformers
Trained on b1n1yam/amharic-combined-corpus
Optimized for efficient Amharic NLP
Inspired by the Ethiopian AI community's need for better language tools

⭐ If you find this useful, please star the repo! ⭐

Made with ❤️ for Ethiopian AI

Downloads last month: -; Downloads are not tracked for this model. How to track

NaolBM
/

Geez-BBPE

🚀 Geez-BBPE Tokenizer (`NaolBM/Geez-BBPE`)

📊 Benchmark Comparison

🧠 Why Geez-BBPE?

📚 Training Details

📁 Files

🚀 Usage

Standalone Usage

Extending Other Models with Geez-BBPE

❌ Common Mistake to Avoid

📊 Intended Use

⚠️ Limitations

✅ Evaluation

📜 License

📌 Citation

🙏 Acknowledgments

⭐ If you find this useful, please star the repo! ⭐

Dataset used to train NaolBM/Geez-BBPE

🚀 Geez-BBPE Tokenizer (NaolBM/Geez-BBPE)

📊 Benchmark Comparison

🧠 Why Geez-BBPE?

📚 Training Details

📁 Files

🚀 Usage

Standalone Usage

Extending Other Models with Geez-BBPE

❌ Common Mistake to Avoid

📊 Intended Use

⚠️ Limitations

✅ Evaluation

📜 License

📌 Citation

🙏 Acknowledgments

⭐ If you find this useful, please star the repo! ⭐

Dataset used to train NaolBM/Geez-BBPE

🚀 Geez-BBPE Tokenizer (`NaolBM/Geez-BBPE`)