Kabyle Tokenizer for T5
SentencePiece tokenizer trained on 787,648 Kabyle sentences from Tatoeba, designed for T5-style models.
Vocabulary
- Size: 32,000 tokens
- Type: BPE (Byte Pair Encoding)
- Character coverage: 99.99%
Special Tokens
<unk>: Unknown token<pad>: Padding token</s>: End of sequence<s>: Beginning of sequence<mask>: Mask token (for T5)
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("boffire/kabyle-tokenizer-T5")
tokens = tokenizer.tokenize("Aqcic-nni yeɣra adlis.")
print(tokens) # ['▁Aqcic', '-', 'nni', '▁yeɣra', '▁adlis', '.']
Training Data
- Source: Tatoeba Kabyle corpus
- Sentences: 787,648
- Cleaning: Greek/Cyrillic contamination removed (ε→ɛ, γ→ɣ, Σ→Ɛ, Γ→Ɣ, Ԑ→Ɛ, ԑ→ɛ)
Comparison with T5-original
| Phrase | T5-original tokens | Kabyle-SPM tokens |
|---|---|---|
| Aqcic-nni yeɣra adlis. | 19 | 6 |
| Tettmeslayeḍ taqbaylit? | 18 | 4 |
| Ur zmireɣ ara ad qqimeɣ argaz-a. | ~20 | 10 |
Kabyle Characters Preserved
- ɛ / Ɛ (open e)
- ɣ / Ɣ (gamma)
- č / Č (c with caron)
- ǧ / Ǧ (g with caron)
- ḍ / Ḍ (d with dot below)
- ḥ / Ḥ (h with dot below)
- ṛ / Ṛ (r with dot below)
- ṣ / Ṣ (s with dot below)
- ṭ / Ṭ (t with dot below)
- ẓ / Ẓ (z with dot below)
Limitations
- Optimized for short sentences (Tatoeba style)
- May split rare compound words (e.g., "tebirt" → "teb" + "irt")
- Requires T5 model with resized embeddings for full compatibility
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support