Captainsl commited on
Commit
7187730
·
verified ·
1 Parent(s): ea0143e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - si
5
+ base_model:
6
+ - HuggingFaceTB/SmolLM2-1.7B
7
+ library_name: transformers
8
+ tags:
9
+ - Genral
10
+ - text-generation-inference
11
+ ---
12
+ # SinhalaLLM (Fine-tuned SmolLM2 + Sinhala tokenizer)
13
+
14
+ Model: HuggingFaceTB/SmolLM2-1.7B (base) + LoRA finetune (merged)
15
+ Tokenizer: polyglots/Extended-Sinhala-LLaMA (custom Sinhala tokenizer)
16
+ Language: Sinhala (si)
17
+
18
+ ## Summary
19
+ This model is a SmolLM2-1.7B base model fine-tuned on Sinhala text (MADLAD_CulturaX_cleaned).
20
+ Finetuning method: 4-bit LoRA finetuning via Unsloth + PEFT; final artifact merged into a standard HF model.
21
+
22
+ ## Training data
23
+ - Source: polyglots/MADLAD_CulturaX_cleaned (filtered to `lang == "si"`)
24
+ - Preprocessing: cleaned and deduplicated; chunked into sequences of length 256; tokenized with `polyglots/Extended-Sinhala-LLaMA`.
25
+ - Train/validation split: 99% / 1%.
26
+
27
+ ## Hyperparameters (high-level)
28
+ - Sequence length: 256
29
+ - LoRA rank (r): 16
30
+ - LoRA alpha: 16
31
+ - LoRA dropout: 0.05
32
+ - Optimizer: AdamW fused
33
+ - Learning rate: 2e-4
34
+ - Batch size (effective): per-device batch 8, gradient accumulation 2 (effective 16)
35
+ - Mixed precision: bf16 or fp16 where available
36
+
37
+ ## Evaluation
38
+ - Quick evaluation performed on a held-out 1% validation sample,
39
+ - Reported metric: perplexity (see run logs in the repo)
40
+
41
+ ## How to use
42
+ Install transformers and load:
43
+ ```python
44
+ from transformers import AutoTokenizer, AutoModelForCausalLM
45
+ tok = AutoTokenizer.from_pretrained("path_or_repo/sinhala_merged")
46
+ model = AutoModelForCausalLM.from_pretrained("path_or_repo/sinhala_merged", device_map="auto")
47
+ ````
48
+
49
+ ## Export / Run locally
50
+
51
+ * To run on CPU or inference frameworks you can create a GGUF with `llama.cpp` converters and quantize to Q4 variants.
52
+
53
+ ## Limitations and risks
54
+
55
+ * Model trained on web-scraped data; it may reproduce harmful content or biases present in the training data.
56
+ * Not safe for high-stakes medical, legal, or safety-critical advice.
57
+
58
+ ## License
59
+
60
+ Specify dataset and model license here.