YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
---
library_name: transformers
tags:
- sindhi
- nlp
- qwen
- tokenizer-extension
- low-resource-languages
- unigram
language:
- sd
- en
base_model: Qwen/Qwen2.5-7B
---
# Qwen2.5-7B Sindhi Tokenizer Extension (20k Unigram)
## Model Details
### Model Description
This is an optimized tokenizer extension for **Qwen2.5-7B**, specifically engineered to enhance performance for the **Sindhi language**. Developed as part of a Master's thesis research project, this model expands the native Qwen vocabulary with **20,000 unique Sindhi tokens** derived from a custom SentencePiece Unigram model.
- **Developed by:** Kashif Ali Turk
- **Supervised by:** Dr. Tafseer Ahmed
- **Model type:** Tokenizer Extension / Vocabulary Expansion
- **Language(s) (NLP):** Sindhi (Primary), English (Base)
- **Finetuned from model:** Qwen/Qwen2.5-7B
## Uses
### Direct Use
This tokenizer serves as a drop-in replacement for the default Qwen2.5 tokenizer when processing Sindhi text. It is designed for:
1. **Efficient Tokenization**: Reducing the sequence length of Sindhi text for faster inference and lower memory consumption.
2. **Continual Pre-training**: Providing a structured vocabulary for aligning new Sindhi embeddings.
3. **Advanced NLP Tasks**: Improving model performance on Sindhi-specific summarization, translation, and sentiment analysis.
### Out-of-Scope Use
- This repository contains **tokenizer files only**. It does not include trained model weights for the new tokens; these must be initialized and trained separately.
## How to Get Started with the Model
```python
from transformers import AutoTokenizer
# Load the extended Sindhi tokenizer
tokenizer = AutoTokenizer.from_pretrained("Kashif786/qwen2.5-sindhi-tokenizer")
test_text = "ุฌู
ุงู ุงูุฏูู โุฌฺูณูโ ููุฏ ุชุงุฌ ู
ุญู
ุฏ ุฌู
ุงูู"
encoded = tokenizer.encode(test_text)
print(f"Token IDs: {encoded}")
Training Details
Training Data
The vocabulary was generated using a Sindhi Universal Corpus. The dataset includes:
- Sindhi news archives and digital journalism.
- Traditional Sindhi literature and poetry.
- Web-crawled content to capture contemporary linguistic use.
Preprocessing
- Algorithm: SentencePiece Unigram.
- Vocab Addition: 20,000 new tokens added as
added_tokensto the base Qwen vocabulary. - Formatting: Tiktoken-compatible cleaning to ensure seamless integration with the Qwen architecture.
Evaluation
Results (Empirical Comparison)
Based on testing with formal Sindhi biographical text:
| Metric | Original Qwen2.5 | Extended Qwen (This Model) |
|---|---|---|
| Total Vocab Size | 151,643 | 156,998+ |
| Sindhi Token Count | High (Byte-fallback) | Significant Reduction |
| Chars / Token | ~2.0 | ~4.0+ |
| Sequence Compression | 0% | ~45% - 55% Improvement |
Summary
The extension drastically reduces the "fertility rate" of Sindhi text, allowing the model to process nearly double the information within the same context window compared to the base model.
Technical Specifications
Model Architecture and Objective
The extension utilizes a Unigram approach, which is more effective than standard BPE at identifying meaningful subword units in morphologically rich languages like Sindhi.
Model Card Authors
- Kashif Ali Turk (MSCS Student, MAJU)
Model Card Contact
- LinkedIn: Kashif Ali Turk
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support