Kabyle Tokenizer for T5

SentencePiece tokenizer trained on 787,648 Kabyle sentences from Tatoeba, designed for T5-style models.

Vocabulary

Size: 32,000 tokens
Type: BPE (Byte Pair Encoding)
Character coverage: 99.99%

Special Tokens

<unk>: Unknown token
<pad>: Padding token
</s>: End of sequence
<s>: Beginning of sequence
<mask>: Mask token (for T5)

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("boffire/kabyle-tokenizer-T5")
tokens = tokenizer.tokenize("Aqcic-nni yeɣra adlis.")
print(tokens)  # ['▁Aqcic', '-', 'nni', '▁yeɣra', '▁adlis', '.']

Training Data

Source: Tatoeba Kabyle corpus
Sentences: 787,648
Cleaning: Greek/Cyrillic contamination removed (ε→ɛ, γ→ɣ, Σ→Ɛ, Γ→Ɣ, Ԑ→Ɛ, ԑ→ɛ)

Comparison with T5-original

Phrase	T5-original tokens	Kabyle-SPM tokens
Aqcic-nni yeɣra adlis.	19	6
Tettmeslayeḍ taqbaylit?	18	4
Ur zmireɣ ara ad qqimeɣ argaz-a.	~20	10

Kabyle Characters Preserved

ɛ / Ɛ (open e)
ɣ / Ɣ (gamma)
č / Č (c with caron)
ǧ / Ǧ (g with caron)
ḍ / Ḍ (d with dot below)
ḥ / Ḥ (h with dot below)
ṛ / Ṛ (r with dot below)
ṣ / Ṣ (s with dot below)
ṭ / Ṭ (t with dot below)
ẓ / Ẓ (z with dot below)

Limitations

Optimized for short sentences (Tatoeba style)
May split rare compound words (e.g., "tebirt" → "teb" + "irt")
Requires T5 model with resized embeddings for full compatibility

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support