Balochi SentencePiece BPE Tokenizer

SentencePiece BPE tokenizer trained on Balochi data. Designed for Balochi written in Perso-Arabic script.

Character Inventory

  • 38 standard Balochi letters (exact Unicode codepoints)
  • 6 atomic special sequences: ۓ یے ئے ءُ ءِ ءَ
  • Meaningful diacritics preserved: Fatha (ـَ) Damma (ـُ) Kasra (ـِ) Shadda (ـّ)
  • byte_fallback=True — zero [UNK] tokens guaranteed

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("hafeez007/balochi-tokenizer")
tokens = tok.tokenize("دبستانانی وانشتءَ پیسر")
print(tokens)

Special Tokens

Token ID
[PAD] 0
[UNK] 1
[CLS] 2
[SEP] 3
[MASK] 4
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support