Balochi SentencePiece BPE Tokenizer
SentencePiece BPE tokenizer trained on Balochi data. Designed for Balochi written in Perso-Arabic script.
Character Inventory
- 38 standard Balochi letters (exact Unicode codepoints)
- 6 atomic special sequences: ۓ یے ئے ءُ ءِ ءَ
- Meaningful diacritics preserved: Fatha (ـَ) Damma (ـُ) Kasra (ـِ) Shadda (ـّ)
- byte_fallback=True — zero [UNK] tokens guaranteed
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("hafeez007/balochi-tokenizer")
tokens = tok.tokenize("دبستانانی وانشتءَ پیسر")
print(tokens)
Special Tokens
| Token | ID |
|---|---|
| [PAD] | 0 |
| [UNK] | 1 |
| [CLS] | 2 |
| [SEP] | 3 |
| [MASK] | 4 |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support