Captainsl
/

SinhalaLLM

@@ -1,60 +1,139 @@
----
-license: mit
-language:
-- si
-base_model:
-- HuggingFaceTB/SmolLM2-1.7B
-library_name: transformers
-tags:
-- Genral
-- text-generation-inference
----
-# SinhalaLLM (Fine-tuned SmolLM2 + Sinhala tokenizer)
-Model: HuggingFaceTB/SmolLM2-1.7B (base) + LoRA finetune (merged)
-Tokenizer: polyglots/Extended-Sinhala-LLaMA (custom Sinhala tokenizer)
-Language: Sinhala (si)
-## Summary
-This model is a SmolLM2-1.7B base model fine-tuned on Sinhala text (MADLAD_CulturaX_cleaned).
-Finetuning method: 4-bit LoRA finetuning via Unsloth + PEFT; final artifact merged into a standard HF model.
-## Training data
-- Source: polyglots/MADLAD_CulturaX_cleaned (filtered to `lang == "si"`)
-- Preprocessing: cleaned and deduplicated; chunked into sequences of length 256; tokenized with `polyglots/Extended-Sinhala-LLaMA`.
-- Train/validation split: 99% / 1%.
-## Hyperparameters (high-level)
-- Sequence length: 256
-- LoRA rank (r): 16
-- LoRA alpha: 16
-- LoRA dropout: 0.05
-- Optimizer: AdamW fused
-- Learning rate: 2e-4
-- Batch size (effective): per-device batch 8, gradient accumulation 2 (effective 16)
-- Mixed precision: bf16 or fp16 where available
-## Evaluation
-- Quick evaluation performed on a held-out 1% validation sample,
-- Reported metric: perplexity (see run logs in the repo)
-## How to use
-Install transformers and load:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tok = AutoTokenizer.from_pretrained("path_or_repo/sinhala_merged")
-model = AutoModelForCausalLM.from_pretrained("path_or_repo/sinhala_merged", device_map="auto")
-````
-## Export / Run locally
-* To run on CPU or inference frameworks you can create a GGUF with `llama.cpp` converters and quantize to Q4 variants.
-## Limitations and risks
-* Model trained on web-scraped data; it may reproduce harmful content or biases present in the training data.
-* Not safe for high-stakes medical, legal, or safety-critical advice.
 ## License
-Specify dataset and model license here.

+---
+license: mit
+language:
+- si
+base_model:
+- HuggingFaceTB/SmolLM2-1.7B
+library_name: transformers
+tags:
+- experimental
+- low-resource-languages
+- research
+- proof-of-concept
+---
+# Sinhala Language Model Research - SmolLM2 Fine-tuning Attempt
+**⚠️ EXPERIMENTAL MODEL - NOT FOR PRODUCTION USE**
+## Model Description
+- **Base Model:** HuggingFaceTB/SmolLM2-1.7B
+- **Fine-tuning Method:** QLoRA (4-bit quantization with LoRA)
+- **Target Language:** Sinhala (සිංහල)
+- **Status:** Research prototype with significant limitations
+## Research Context
+This model represents an undergraduate research attempt to adapt SmolLM2-1.7B for Sinhala language generation. Part of thesis: "Developing a Fluent Sinhala Language Model: Enhancing AI's Cultural and Linguistic Adaptability" (NSBM Green University, 2025).
+## Training Details
+### Dataset
+- **Size:** 427,000 raw examples → 406,532 after cleaning
+- **Sources:**
+  - YouTube comments (32%)
+  - Web scraped content (35%)
+  - Translated instructions (23%)
+  - Curated texts (10%)
+- **Data Quality:** Mixed (social media, news, translated content)
+- **Processing:** Custom cleaning pipeline removing URLs, emails, duplicates
+### Training Configuration
+- **Hardware:** NVIDIA RTX 4090 (24GB VRAM) via Vast.ai
+- **Training Time:** 48 hours
+- **Total Cost:** $19.20 (budget-constrained research)
+- **Framework:** Unsloth for memory efficiency
+- **LoRA Parameters:**
+  - Rank (r): 16
+  - Alpha: 16
+  - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
+  - Trainable parameters: 8.4M/1.7B (99.5% reduction)
+### Hyperparameters
+- Learning rate: 2e-5
+- Batch size: 8 (gradient accumulation: 1)
+- Max sequence length: 2048 (reduced to 512 for memory)
+- Mixed precision: FP16
+- Optimizer: adamw_8bit
+## Evaluation Results
+### Quantitative Metrics
+- **Perplexity:** 218,443 (target was <50) ❌
+- **BLEU Score:** 0.0000 ❌
+- **Training Loss:** 1.847 (converged)
+- **Task Completion Rate:**
+  - General conversation: 0%
+  - Mathematics: 100% (but output corrupted)
+  - Cultural context: 0%
+  - Instruction following: 33%
+### Critical Issues Discovered
+⚠️ **Tokenizer Incompatibility:** The model exhibits catastrophic tokenizer-model mismatch, generating English vocabulary tokens ("Drum", "Chiefs", "RESP") instead of Sinhala text. This represents a fundamental architectural incompatibility between SmolLM2's tokenizer and Sinhala script.
+## Sample Outputs (Showing Failure Pattern)
+```
+Input: "ඔබේ නම කුමක්ද?"
+Expected: "මගේ නම [name] වේ"
+Actual: "Drum Chiefs RESP frontend(direction..."
+```
+## Research Contributions
+Despite technical failure, this research provides:
+1. **Dataset:** 427,000 curated Sinhala examples (largest publicly available)
+2. **Pipeline:** Reproducible training framework for low-resource languages
+3. **Discovery:** Documentation of critical tokenizer challenges for non-Latin scripts
+4. **Methodology:** Budget-conscious approach ($30 total) for LLM research
+## Limitations & Warnings
+- ❌ **Does NOT generate coherent Sinhala text**
+- ❌ **Tokenizer fundamentally incompatible with Sinhala**
+- ❌ **Not suitable for any production use**
+- ✅ **Useful only as research artifact and negative result documentation**
+## Intended Use
+This model is shared for:
+- Academic transparency and reproducibility
+- Documentation of challenges in low-resource language AI
+- Foundation for future research improvements
+- Example of tokenizer-model compatibility issues
+## Recommendations for Future Work
+1. Use multilingual base models (mT5, XLM-R, BLOOM)
+2. Develop Sinhala-specific tokenizer
+3. Increase dataset to 1M+ examples
+4. Consider character-level or byte-level models
+## How to Reproduce Issues
 ```python
+# This will demonstrate the tokenizer problem
 from transformers import AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("path/to/model")
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")
+input_text = "ශ්‍රී ලංකාව"
+inputs = tokenizer(input_text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+# Output will be gibberish English tokens
+```
+## Citation
+```bibtex
+@thesis{dharmasiri2025sinhala,
+  title={Developing a Fluent Sinhala Language Model: Enhancing AI's Cultural and Linguistic Adaptability},
+  author={Dharmasiri, H.M.A.H.},
+  year={2025},
+  school={NSBM Green University},
+  note={Undergraduate thesis documenting challenges in low-resource language AI}
+}
+```
+## Ethical Considerations
+- Model outputs are not reliable for Sinhala generation
+- Should not be used for any decision-making
+- Shared for research transparency only
 ## License
+MIT License - for research and educational purposes