Kashif786's picture
Update README.md
f7b4b6b verified
metadata
library_name: transformers
tags:
  - sindhi
  - nlp
  - tokenizer-extension
  - gemma
  - low-resource-languages
  - unigram
language:
  - sd
  - en
base_model: google/gemma-7b-it

Gemma-7B-it Sindhi Tokenizer Extension (20k Unigram)

Model Details

Model Description

This is a specialized extension of the Gemma-7B-it tokenizer, optimized for the Sindhi language. Developed as part of a Master’s thesis research project at Muhammad Ali Jinnah University (MAJU), this model addresses the "fertility rate" challenge in low-resource NLP by expanding the base vocabulary with a custom 20,000-token Unigram SentencePiece model trained on a diverse Sindhi corpus.

  • Developed by: Kashif Ali Turk
  • Supervised by: Dr. Tafseer Ahmed
  • Model type: Tokenizer Extension / Vocabulary Expansion
  • Language(s) (NLP): Sindhi (Primary), English (Base)
  • Finetuned from model: google/gemma-7b-it

Uses

Direct Use

This tokenizer is intended to be used as a drop-in replacement for the standard Gemma tokenizer when working with Sindhi text. It is designed for:

  1. Continual Pre-training (CPT): Aligning new Sindhi embeddings with the base model's latent space.
  2. Fine-tuning: Training Gemma on Sindhi-specific tasks like translation, summarization, or interpreting Sindhi poetry (e.g., Shah Jo Risalo).
  3. Efficient Inference: Reducing latency and memory usage for Sindhi-language applications.

Out-of-Scope Use

  • This is a tokenizer-only upload. It does not include pre-trained weights for the new Sindhi tokens; these must be initialized (e.g., via mean-pooling) and trained.
  • It is not designed for languages other than Sindhi and English.

Bias, Risks, and Limitations

While the tokenizer improves efficiency, users should be aware that:

  • The vocabulary is based on a specific Sindhi news and literature corpus; niche dialects or archaic scripts may still face sub-optimal tokenization.
  • New token embeddings are initialized with noise/mean values and require training before the model can generate coherent Sindhi text.

How to Get Started with the Model

from transformers import AutoTokenizer

# Load the extended Sindhi tokenizer
tokenizer = AutoTokenizer.from_pretrained("Kashif786/gemma-7b-it-sindhi-tokenizer")

test_text = "اسلام آباد (ويب ڊيسڪ) پرڏيهي کاتي جي ترجمان چيو آهي"
tokens = tokenizer.tokenize(test_text)
print(f"Tokenized Sequence: {tokens}")