Gemma-7B-it Sindhi Tokenizer Extension (20k Unigram)

Model Details

Model Description

This is a specialized extension of the Gemma-7B-it tokenizer, optimized for the Sindhi language. Developed as part of a Master’s thesis research project at Muhammad Ali Jinnah University (MAJU), this model addresses the "fertility rate" challenge in low-resource NLP by expanding the base vocabulary with a custom 20,000-token Unigram SentencePiece model trained on a diverse Sindhi corpus.

  • Developed by: Kashif Ali Turk
  • Supervised by: Dr. Tafseer Ahmed
  • Model type: Tokenizer Extension / Vocabulary Expansion
  • Language(s) (NLP): Sindhi (Primary), English (Base)
  • Finetuned from model: google/gemma-7b-it

Uses

Direct Use

This tokenizer is intended to be used as a drop-in replacement for the standard Gemma tokenizer when working with Sindhi text. It is designed for:

  1. Continual Pre-training (CPT): Aligning new Sindhi embeddings with the base model's latent space.
  2. Fine-tuning: Training Gemma on Sindhi-specific tasks like translation, summarization, or interpreting Sindhi poetry (e.g., Shah Jo Risalo).
  3. Efficient Inference: Reducing latency and memory usage for Sindhi-language applications.

Out-of-Scope Use

  • This is a tokenizer-only upload. It does not include pre-trained weights for the new Sindhi tokens; these must be initialized (e.g., via mean-pooling) and trained.
  • It is not designed for languages other than Sindhi and English.

Bias, Risks, and Limitations

While the tokenizer improves efficiency, users should be aware that:

  • The vocabulary is based on a specific Sindhi news and literature corpus; niche dialects or archaic scripts may still face sub-optimal tokenization.
  • New token embeddings are initialized with noise/mean values and require training before the model can generate coherent Sindhi text.

How to Get Started with the Model

from transformers import AutoTokenizer

# Load the extended Sindhi tokenizer
tokenizer = AutoTokenizer.from_pretrained("Kashif786/gemma-7b-it-sindhi-tokenizer")

test_text = "اسلام آباد (ويب ڊيسڪ) پرڏيهي کاتي جي ترجمان چيو آهي"
tokens = tokenizer.tokenize(test_text)
print(f"Tokenized Sequence: {tokens}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kashif786/gemma-7b-it-sindhi-tokenizer

Base model

google/gemma-7b
Finetuned
google/gemma-7b-it
Finetuned
(23)
this model