Update README.md

f7b4b6b verified 2 months ago

2.48 kB

library_name: transformers
tags:
  - sindhi
  - nlp
  - tokenizer-extension
  - gemma
  - low-resource-languages
  - unigram
language:
  - sd
  - en
base_model: google/gemma-7b-it

Gemma-7B-it Sindhi Tokenizer Extension (20k Unigram)

Model Details

Model Description

This is a specialized extension of the Gemma-7B-it tokenizer, optimized for the Sindhi language. Developed as part of a Master’s thesis research project at Muhammad Ali Jinnah University (MAJU), this model addresses the "fertility rate" challenge in low-resource NLP by expanding the base vocabulary with a custom 20,000-token Unigram SentencePiece model trained on a diverse Sindhi corpus.

Developed by: Kashif Ali Turk
Supervised by: Dr. Tafseer Ahmed
Model type: Tokenizer Extension / Vocabulary Expansion
Language(s) (NLP): Sindhi (Primary), English (Base)
Finetuned from model: google/gemma-7b-it

Uses

Direct Use

This tokenizer is intended to be used as a drop-in replacement for the standard Gemma tokenizer when working with Sindhi text. It is designed for:

Continual Pre-training (CPT): Aligning new Sindhi embeddings with the base model's latent space.
Fine-tuning: Training Gemma on Sindhi-specific tasks like translation, summarization, or interpreting Sindhi poetry (e.g., Shah Jo Risalo).
Efficient Inference: Reducing latency and memory usage for Sindhi-language applications.

Out-of-Scope Use

This is a tokenizer-only upload. It does not include pre-trained weights for the new Sindhi tokens; these must be initialized (e.g., via mean-pooling) and trained.
It is not designed for languages other than Sindhi and English.

Bias, Risks, and Limitations

While the tokenizer improves efficiency, users should be aware that:

The vocabulary is based on a specific Sindhi news and literature corpus; niche dialects or archaic scripts may still face sub-optimal tokenization.
New token embeddings are initialized with noise/mean values and require training before the model can generate coherent Sindhi text.

How to Get Started with the Model

from transformers import AutoTokenizer

# Load the extended Sindhi tokenizer
tokenizer = AutoTokenizer.from_pretrained("Kashif786/gemma-7b-it-sindhi-tokenizer")

test_text = "اسلام آباد (ويب ڊيسڪ) پرڏيهي کاتي جي ترجمان چيو آهي"
tokens = tokenizer.tokenize(test_text)
print(f"Tokenized Sequence: {tokens}")