Kashif786 commited on
Commit
e1b1f9c
·
verified ·
1 Parent(s): f0c19d5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ```markdown
3
+ ---
4
+ library_name: transformers
5
+ tags:
6
+ - sindhi
7
+ - nlp
8
+ - qwen
9
+ - tokenizer-extension
10
+ - low-resource-languages
11
+ - unigram
12
+ language:
13
+ - sd
14
+ - en
15
+ base_model: Qwen/Qwen2.5-7B
16
+ ---
17
+
18
+ # Qwen2.5-7B Sindhi Tokenizer Extension (20k Unigram)
19
+
20
+ ## Model Details
21
+
22
+ ### Model Description
23
+
24
+ This is an optimized tokenizer extension for **Qwen2.5-7B**, specifically engineered to enhance performance for the **Sindhi language**. Developed as part of a Master's thesis research project, this model expands the native Qwen vocabulary with **20,000 unique Sindhi tokens** derived from a custom SentencePiece Unigram model.
25
+
26
+ - **Developed by:** Kashif Ali Turk
27
+ - **Supervised by:** Dr. Tafseer Ahmed
28
+ - **Model type:** Tokenizer Extension / Vocabulary Expansion
29
+ - **Language(s) (NLP):** Sindhi (Primary), English (Base)
30
+ - **Finetuned from model:** Qwen/Qwen2.5-7B
31
+
32
+ ## Uses
33
+
34
+ ### Direct Use
35
+
36
+ This tokenizer serves as a drop-in replacement for the default Qwen2.5 tokenizer when processing Sindhi text. It is designed for:
37
+ 1. **Efficient Tokenization**: Reducing the sequence length of Sindhi text for faster inference and lower memory consumption.
38
+ 2. **Continual Pre-training**: Providing a structured vocabulary for aligning new Sindhi embeddings.
39
+ 3. **Advanced NLP Tasks**: Improving model performance on Sindhi-specific summarization, translation, and sentiment analysis.
40
+
41
+ ### Out-of-Scope Use
42
+
43
+ - This repository contains **tokenizer files only**. It does not include trained model weights for the new tokens; these must be initialized and trained separately.
44
+
45
+ ## How to Get Started with the Model
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer
49
+
50
+ # Load the extended Sindhi tokenizer
51
+ tokenizer = AutoTokenizer.from_pretrained("Kashif786/qwen2.5-sindhi-tokenizer")
52
+
53
+ test_text = "جمال الدين ’جوڳي‘ ولد تاج محمد جمالي"
54
+ encoded = tokenizer.encode(test_text)
55
+ print(f"Token IDs: {encoded}")
56
+
57
+ ```
58
+
59
+ ## Training Details
60
+
61
+ ### Training Data
62
+
63
+ The vocabulary was generated using a **Sindhi Universal Corpus**. The dataset includes:
64
+
65
+ * Sindhi news archives and digital journalism.
66
+ * Traditional Sindhi literature and poetry.
67
+ * Web-crawled content to capture contemporary linguistic use.
68
+
69
+ ### Preprocessing
70
+
71
+ * **Algorithm**: SentencePiece Unigram.
72
+ * **Vocab Addition**: 20,000 new tokens added as `added_tokens` to the base Qwen vocabulary.
73
+ * **Formatting**: Tiktoken-compatible cleaning to ensure seamless integration with the Qwen architecture.
74
+
75
+ ## Evaluation
76
+
77
+ ### Results (Empirical Comparison)
78
+
79
+ Based on testing with formal Sindhi biographical text:
80
+
81
+ | Metric | Original Qwen2.5 | Extended Qwen (This Model) |
82
+ | --- | --- | --- |
83
+ | **Total Vocab Size** | 151,643 | **156,998+** |
84
+ | **Sindhi Token Count** | High (Byte-fallback) | **Significant Reduction** |
85
+ | **Chars / Token** | ~2.0 | **~4.0+** |
86
+ | **Sequence Compression** | 0% | **~45% - 55% Improvement** |
87
+
88
+ ### Summary
89
+
90
+ The extension drastically reduces the "fertility rate" of Sindhi text, allowing the model to process nearly **double the information** within the same context window compared to the base model.
91
+
92
+ ## Technical Specifications
93
+
94
+ ### Model Architecture and Objective
95
+
96
+ The extension utilizes a **Unigram** approach, which is more effective than standard BPE at identifying meaningful subword units in morphologically rich languages like Sindhi.
97
+
98
+ ## Model Card Authors
99
+
100
+ * **Kashif Ali Turk** (MSCS Student, MAJU)
101
+
102
+ ## Model Card Contact
103
+
104
+ * LinkedIn: [Kashif Ali Turk](www.linkedin.com/in/kashif-ali-2727a91a5)