Captainsl commited on
Commit
09b82a2
·
verified ·
1 Parent(s): d9ca3df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -51
README.md CHANGED
@@ -1,60 +1,139 @@
1
- ---
2
- license: mit
3
- language:
4
- - si
5
- base_model:
6
- - HuggingFaceTB/SmolLM2-1.7B
7
- library_name: transformers
8
- tags:
9
- - Genral
10
- - text-generation-inference
11
- ---
12
- # SinhalaLLM (Fine-tuned SmolLM2 + Sinhala tokenizer)
13
-
14
- Model: HuggingFaceTB/SmolLM2-1.7B (base) + LoRA finetune (merged)
15
- Tokenizer: polyglots/Extended-Sinhala-LLaMA (custom Sinhala tokenizer)
16
- Language: Sinhala (si)
17
-
18
- ## Summary
19
- This model is a SmolLM2-1.7B base model fine-tuned on Sinhala text (MADLAD_CulturaX_cleaned).
20
- Finetuning method: 4-bit LoRA finetuning via Unsloth + PEFT; final artifact merged into a standard HF model.
21
-
22
- ## Training data
23
- - Source: polyglots/MADLAD_CulturaX_cleaned (filtered to `lang == "si"`)
24
- - Preprocessing: cleaned and deduplicated; chunked into sequences of length 256; tokenized with `polyglots/Extended-Sinhala-LLaMA`.
25
- - Train/validation split: 99% / 1%.
26
-
27
- ## Hyperparameters (high-level)
28
- - Sequence length: 256
29
- - LoRA rank (r): 16
30
- - LoRA alpha: 16
31
- - LoRA dropout: 0.05
32
- - Optimizer: AdamW fused
33
- - Learning rate: 2e-4
34
- - Batch size (effective): per-device batch 8, gradient accumulation 2 (effective 16)
35
- - Mixed precision: bf16 or fp16 where available
36
-
37
- ## Evaluation
38
- - Quick evaluation performed on a held-out 1% validation sample,
39
- - Reported metric: perplexity (see run logs in the repo)
40
-
41
- ## How to use
42
- Install transformers and load:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ```python
 
44
  from transformers import AutoTokenizer, AutoModelForCausalLM
45
- tok = AutoTokenizer.from_pretrained("path_or_repo/sinhala_merged")
46
- model = AutoModelForCausalLM.from_pretrained("path_or_repo/sinhala_merged", device_map="auto")
47
- ````
48
 
49
- ## Export / Run locally
 
50
 
51
- * To run on CPU or inference frameworks you can create a GGUF with `llama.cpp` converters and quantize to Q4 variants.
 
 
 
 
 
52
 
53
- ## Limitations and risks
 
 
 
 
 
 
 
 
 
54
 
55
- * Model trained on web-scraped data; it may reproduce harmful content or biases present in the training data.
56
- * Not safe for high-stakes medical, legal, or safety-critical advice.
 
 
57
 
58
  ## License
 
59
 
60
- Specify dataset and model license here.
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - si
5
+ base_model:
6
+ - HuggingFaceTB/SmolLM2-1.7B
7
+ library_name: transformers
8
+ tags:
9
+ - experimental
10
+ - low-resource-languages
11
+ - research
12
+ - proof-of-concept
13
+ ---
14
+
15
+ # Sinhala Language Model Research - SmolLM2 Fine-tuning Attempt
16
+
17
+ **⚠️ EXPERIMENTAL MODEL - NOT FOR PRODUCTION USE**
18
+
19
+ ## Model Description
20
+ - **Base Model:** HuggingFaceTB/SmolLM2-1.7B
21
+ - **Fine-tuning Method:** QLoRA (4-bit quantization with LoRA)
22
+ - **Target Language:** Sinhala (සිංහල)
23
+ - **Status:** Research prototype with significant limitations
24
+
25
+ ## Research Context
26
+ This model represents an undergraduate research attempt to adapt SmolLM2-1.7B for Sinhala language generation. Part of thesis: "Developing a Fluent Sinhala Language Model: Enhancing AI's Cultural and Linguistic Adaptability" (NSBM Green University, 2025).
27
+
28
+ ## Training Details
29
+
30
+ ### Dataset
31
+ - **Size:** 427,000 raw examples → 406,532 after cleaning
32
+ - **Sources:**
33
+ - YouTube comments (32%)
34
+ - Web scraped content (35%)
35
+ - Translated instructions (23%)
36
+ - Curated texts (10%)
37
+ - **Data Quality:** Mixed (social media, news, translated content)
38
+ - **Processing:** Custom cleaning pipeline removing URLs, emails, duplicates
39
+
40
+ ### Training Configuration
41
+ - **Hardware:** NVIDIA RTX 4090 (24GB VRAM) via Vast.ai
42
+ - **Training Time:** 48 hours
43
+ - **Total Cost:** $19.20 (budget-constrained research)
44
+ - **Framework:** Unsloth for memory efficiency
45
+ - **LoRA Parameters:**
46
+ - Rank (r): 16
47
+ - Alpha: 16
48
+ - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
49
+ - Trainable parameters: 8.4M/1.7B (99.5% reduction)
50
+
51
+ ### Hyperparameters
52
+ - Learning rate: 2e-5
53
+ - Batch size: 8 (gradient accumulation: 1)
54
+ - Max sequence length: 2048 (reduced to 512 for memory)
55
+ - Mixed precision: FP16
56
+ - Optimizer: adamw_8bit
57
+
58
+ ## Evaluation Results
59
+
60
+ ### Quantitative Metrics
61
+ - **Perplexity:** 218,443 (target was <50) ❌
62
+ - **BLEU Score:** 0.0000 ❌
63
+ - **Training Loss:** 1.847 (converged)
64
+ - **Task Completion Rate:**
65
+ - General conversation: 0%
66
+ - Mathematics: 100% (but output corrupted)
67
+ - Cultural context: 0%
68
+ - Instruction following: 33%
69
+
70
+ ### Critical Issues Discovered
71
+ ⚠️ **Tokenizer Incompatibility:** The model exhibits catastrophic tokenizer-model mismatch, generating English vocabulary tokens ("Drum", "Chiefs", "RESP") instead of Sinhala text. This represents a fundamental architectural incompatibility between SmolLM2's tokenizer and Sinhala script.
72
+
73
+ ## Sample Outputs (Showing Failure Pattern)
74
+ ```
75
+ Input: "ඔබේ නම කුමක්ද?"
76
+ Expected: "මගේ නම [name] වේ"
77
+ Actual: "Drum Chiefs RESP frontend(direction..."
78
+ ```
79
+
80
+ ## Research Contributions
81
+ Despite technical failure, this research provides:
82
+ 1. **Dataset:** 427,000 curated Sinhala examples (largest publicly available)
83
+ 2. **Pipeline:** Reproducible training framework for low-resource languages
84
+ 3. **Discovery:** Documentation of critical tokenizer challenges for non-Latin scripts
85
+ 4. **Methodology:** Budget-conscious approach ($30 total) for LLM research
86
+
87
+ ## Limitations & Warnings
88
+ - ❌ **Does NOT generate coherent Sinhala text**
89
+ - ❌ **Tokenizer fundamentally incompatible with Sinhala**
90
+ - ❌ **Not suitable for any production use**
91
+ - ✅ **Useful only as research artifact and negative result documentation**
92
+
93
+ ## Intended Use
94
+ This model is shared for:
95
+ - Academic transparency and reproducibility
96
+ - Documentation of challenges in low-resource language AI
97
+ - Foundation for future research improvements
98
+ - Example of tokenizer-model compatibility issues
99
+
100
+ ## Recommendations for Future Work
101
+ 1. Use multilingual base models (mT5, XLM-R, BLOOM)
102
+ 2. Develop Sinhala-specific tokenizer
103
+ 3. Increase dataset to 1M+ examples
104
+ 4. Consider character-level or byte-level models
105
+
106
+ ## How to Reproduce Issues
107
  ```python
108
+ # This will demonstrate the tokenizer problem
109
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
110
 
111
+ model = AutoModelForCausalLM.from_pretrained("path/to/model")
112
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")
113
 
114
+ input_text = "ශ්‍රී ලංකාව"
115
+ inputs = tokenizer(input_text, return_tensors="pt")
116
+ outputs = model.generate(**inputs, max_length=50)
117
+ print(tokenizer.decode(outputs[0]))
118
+ # Output will be gibberish English tokens
119
+ ```
120
 
121
+ ## Citation
122
+ ```bibtex
123
+ @thesis{dharmasiri2025sinhala,
124
+ title={Developing a Fluent Sinhala Language Model: Enhancing AI's Cultural and Linguistic Adaptability},
125
+ author={Dharmasiri, H.M.A.H.},
126
+ year={2025},
127
+ school={NSBM Green University},
128
+ note={Undergraduate thesis documenting challenges in low-resource language AI}
129
+ }
130
+ ```
131
 
132
+ ## Ethical Considerations
133
+ - Model outputs are not reliable for Sinhala generation
134
+ - Should not be used for any decision-making
135
+ - Shared for research transparency only
136
 
137
  ## License
138
+ MIT License - for research and educational purposes
139