Kashif786
/

qwen2.5-sindhi-tokenizer

Model card Files Files and versions

xet

Community

Kashif786 commited on Feb 21

Commit

e1b1f9c

verified ·

1 Parent(s): f0c19d5

Create README.md

Browse files

Files changed (1) hide show

README.md +104 -0

README.md ADDED Viewed

	@@ -0,0 +1,104 @@

+```markdown
+---
+library_name: transformers
+tags:
+- sindhi
+- nlp
+- qwen
+- tokenizer-extension
+- low-resource-languages
+- unigram
+language:
+- sd
+- en
+base_model: Qwen/Qwen2.5-7B
+---
+# Qwen2.5-7B Sindhi Tokenizer Extension (20k Unigram)
+## Model Details
+### Model Description
+This is an optimized tokenizer extension for **Qwen2.5-7B**, specifically engineered to enhance performance for the **Sindhi language**. Developed as part of a Master's thesis research project, this model expands the native Qwen vocabulary with **20,000 unique Sindhi tokens** derived from a custom SentencePiece Unigram model.
+- **Developed by:** Kashif Ali Turk
+- **Supervised by:** Dr. Tafseer Ahmed
+- **Model type:** Tokenizer Extension / Vocabulary Expansion
+- **Language(s) (NLP):** Sindhi (Primary), English (Base)
+- **Finetuned from model:** Qwen/Qwen2.5-7B
+## Uses
+### Direct Use
+This tokenizer serves as a drop-in replacement for the default Qwen2.5 tokenizer when processing Sindhi text. It is designed for:
+1. **Efficient Tokenization**: Reducing the sequence length of Sindhi text for faster inference and lower memory consumption.
+2. **Continual Pre-training**: Providing a structured vocabulary for aligning new Sindhi embeddings.
+3. **Advanced NLP Tasks**: Improving model performance on Sindhi-specific summarization, translation, and sentiment analysis.
+### Out-of-Scope Use
+- This repository contains **tokenizer files only**. It does not include trained model weights for the new tokens; these must be initialized and trained separately.
+## How to Get Started with the Model
+```python
+from transformers import AutoTokenizer
+# Load the extended Sindhi tokenizer
+tokenizer = AutoTokenizer.from_pretrained("Kashif786/qwen2.5-sindhi-tokenizer")
+test_text = "جمال الدين ’جوڳي‘ ولد تاج محمد جمالي"
+encoded = tokenizer.encode(test_text)
+print(f"Token IDs: {encoded}")
+```
+## Training Details
+### Training Data
+The vocabulary was generated using a **Sindhi Universal Corpus**. The dataset includes:
+* Sindhi news archives and digital journalism.
+* Traditional Sindhi literature and poetry.
+* Web-crawled content to capture contemporary linguistic use.
+### Preprocessing
+* **Algorithm**: SentencePiece Unigram.
+* **Vocab Addition**: 20,000 new tokens added as `added_tokens` to the base Qwen vocabulary.
+* **Formatting**: Tiktoken-compatible cleaning to ensure seamless integration with the Qwen architecture.
+## Evaluation
+### Results (Empirical Comparison)
+Based on testing with formal Sindhi biographical text:
+| Metric | Original Qwen2.5 | Extended Qwen (This Model) |
+| --- | --- | --- |
+| **Total Vocab Size** | 151,643 | **156,998+** |
+| **Sindhi Token Count** | High (Byte-fallback) | **Significant Reduction** |
+| **Chars / Token** | ~2.0 | **~4.0+** |
+| **Sequence Compression** | 0% | **~45% - 55% Improvement** |
+### Summary
+The extension drastically reduces the "fertility rate" of Sindhi text, allowing the model to process nearly **double the information** within the same context window compared to the base model.
+## Technical Specifications
+### Model Architecture and Objective
+The extension utilizes a **Unigram** approach, which is more effective than standard BPE at identifying meaningful subword units in morphologically rich languages like Sindhi.
+## Model Card Authors
+* **Kashif Ali Turk** (MSCS Student, MAJU)
+## Model Card Contact
+* LinkedIn: [Kashif Ali Turk](www.linkedin.com/in/kashif-ali-2727a91a5)