Gemma-7B-it Sindhi Tokenizer Extension (20k Unigram)
Model Details
Model Description
This is a specialized extension of the Gemma-7B-it tokenizer, optimized for the Sindhi language. Developed as part of a Master’s thesis research project at Muhammad Ali Jinnah University (MAJU), this model addresses the "fertility rate" challenge in low-resource NLP by expanding the base vocabulary with a custom 20,000-token Unigram SentencePiece model trained on a diverse Sindhi corpus.
- Developed by: Kashif Ali Turk
- Supervised by: Dr. Tafseer Ahmed
- Model type: Tokenizer Extension / Vocabulary Expansion
- Language(s) (NLP): Sindhi (Primary), English (Base)
- Finetuned from model: google/gemma-7b-it
Uses
Direct Use
This tokenizer is intended to be used as a drop-in replacement for the standard Gemma tokenizer when working with Sindhi text. It is designed for:
- Continual Pre-training (CPT): Aligning new Sindhi embeddings with the base model's latent space.
- Fine-tuning: Training Gemma on Sindhi-specific tasks like translation, summarization, or interpreting Sindhi poetry (e.g., Shah Jo Risalo).
- Efficient Inference: Reducing latency and memory usage for Sindhi-language applications.
Out-of-Scope Use
- This is a tokenizer-only upload. It does not include pre-trained weights for the new Sindhi tokens; these must be initialized (e.g., via mean-pooling) and trained.
- It is not designed for languages other than Sindhi and English.
Bias, Risks, and Limitations
While the tokenizer improves efficiency, users should be aware that:
- The vocabulary is based on a specific Sindhi news and literature corpus; niche dialects or archaic scripts may still face sub-optimal tokenization.
- New token embeddings are initialized with noise/mean values and require training before the model can generate coherent Sindhi text.
How to Get Started with the Model
from transformers import AutoTokenizer
# Load the extended Sindhi tokenizer
tokenizer = AutoTokenizer.from_pretrained("Kashif786/gemma-7b-it-sindhi-tokenizer")
test_text = "اسلام آباد (ويب ڊيسڪ) پرڏيهي کاتي جي ترجمان چيو آهي"
tokens = tokenizer.tokenize(test_text)
print(f"Tokenized Sequence: {tokens}")
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support