YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

---
library_name: transformers
tags:
- sindhi
- nlp
- qwen
- tokenizer-extension
- low-resource-languages
- unigram
language:
- sd
- en
base_model: Qwen/Qwen2.5-7B
---

# Qwen2.5-7B Sindhi Tokenizer Extension (20k Unigram)

## Model Details

### Model Description

This is an optimized tokenizer extension for **Qwen2.5-7B**, specifically engineered to enhance performance for the **Sindhi language**. Developed as part of a Master's thesis research project, this model expands the native Qwen vocabulary with **20,000 unique Sindhi tokens** derived from a custom SentencePiece Unigram model.

- **Developed by:** Kashif Ali Turk
- **Supervised by:** Dr. Tafseer Ahmed
- **Model type:** Tokenizer Extension / Vocabulary Expansion
- **Language(s) (NLP):** Sindhi (Primary), English (Base)
- **Finetuned from model:** Qwen/Qwen2.5-7B

## Uses

### Direct Use

This tokenizer serves as a drop-in replacement for the default Qwen2.5 tokenizer when processing Sindhi text. It is designed for:
1. **Efficient Tokenization**: Reducing the sequence length of Sindhi text for faster inference and lower memory consumption.
2. **Continual Pre-training**: Providing a structured vocabulary for aligning new Sindhi embeddings.
3. **Advanced NLP Tasks**: Improving model performance on Sindhi-specific summarization, translation, and sentiment analysis.

### Out-of-Scope Use

- This repository contains **tokenizer files only**. It does not include trained model weights for the new tokens; these must be initialized and trained separately.

## How to Get Started with the Model

```python
from transformers import AutoTokenizer

# Load the extended Sindhi tokenizer
tokenizer = AutoTokenizer.from_pretrained("Kashif786/qwen2.5-sindhi-tokenizer")

test_text = "جمال الدين ’جوڳي‘ ولد تاج محمد جمالي"
encoded = tokenizer.encode(test_text)
print(f"Token IDs: {encoded}")

Training Details

Training Data

The vocabulary was generated using a Sindhi Universal Corpus. The dataset includes:

Sindhi news archives and digital journalism.
Traditional Sindhi literature and poetry.
Web-crawled content to capture contemporary linguistic use.

Preprocessing

Algorithm: SentencePiece Unigram.
Vocab Addition: 20,000 new tokens added as added_tokens to the base Qwen vocabulary.
Formatting: Tiktoken-compatible cleaning to ensure seamless integration with the Qwen architecture.

Evaluation

Results (Empirical Comparison)

Based on testing with formal Sindhi biographical text:

Metric	Original Qwen2.5	Extended Qwen (This Model)
Total Vocab Size	151,643	156,998+
Sindhi Token Count	High (Byte-fallback)	Significant Reduction
Chars / Token	~2.0	~4.0+
Sequence Compression	0%	~45% - 55% Improvement

Summary

The extension drastically reduces the "fertility rate" of Sindhi text, allowing the model to process nearly double the information within the same context window compared to the base model.

Technical Specifications

Model Architecture and Objective

The extension utilizes a Unigram approach, which is more effective than standard BPE at identifying meaningful subword units in morphologically rich languages like Sindhi.

Model Card Authors

Kashif Ali Turk (MSCS Student, MAJU)

Model Card Contact

LinkedIn: Kashif Ali Turk

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support