SinLlama_v01 / README.md

Update README.md

4f464b1 verified 4 months ago

7.99 kB

	---
	datasets:
	- polyglots/MADLAD_CulturaX_cleaned
	language:
	- si
	metrics:
	- precision
	- recall
	- f1
	base_model:
	- meta-llama/Meta-Llama-3-8B
	library_name: peft
	---
	base_model: meta-llama/Meta-Llama-3-8B
	library_name: peft
	---

	# Model Card for SinLlama

	SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B.

	DISCLAIMER:
	This is a base model, which has NOT been instruct-tuned. So you still need to do task-specific fine-tuning.
	---

	## Model Details

	### Model Description

	SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus.

	Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models.

	- Developed by: H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1}
	- Funded by: CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2}
	- Shared by: Polyglots team
	- Model type: Decoder-only autoregressive transformer LLM
	- Language(s) (NLP): Sinhala (සිංහල)
	- License: Same as base model (Meta Llama 3 license)
	- Finetuned from model: meta-llama/Meta-Llama-3-8B

	### Model Sources

	- Repository: [Hugging Face - SinLlama v01](https://huggingface.co/polyglots/SinLlama_v01)
	- Paper: [SinLlama: A Large Language Model for Sinhala](https://arxiv.org/abs/2508.09115v2)
	- Dataset: [MADLAD+CulturaX (cleaned Sinhala subset)](https://huggingface.co/datasets/polyglots/MADLAD_CulturaX_cleaned)

	---

	### SinLlama Model Creation
	![SinLlama Logo](asserts/SinLlama.png)

	## Uses


	### Downstream Use
	- Instruction tuning for Sinhala dialogue systems, text classification, etc
	- Cross-lingual applications involving Sinhala
	- Educational and research applications in low-resource NLP

	### Out-of-Scope Use
	- Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala)
	- Sensitive domains (e.g., healthcare, legal) without rigorous validation
	- Malicious generation (hate speech, disinformation)

	---

	## Bias, Risks, and Limitations

	- Bias: Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases).
	- Limitations: Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging.
	- Risk: Misuse in spreading misinformation or biased outputs in Sinhala.

	### Recommendations
	Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is required for robustness.

	---

	## How to Get Started with the Model

	### Install dependencies
	```python
	!pip install unsloth
	!pip install datasets==2.21.0
	!pip install pandas==2.1.4
	```

	### Import dependencies
	```python
	from unsloth import FastLanguageModel, is_bfloat16_supported
	from transformers import TextStreamer, AutoTokenizer
	import torch
	from datasets import load_dataset, DatasetDict, concatenate_datasets, Dataset
	from collections import Counter, defaultdict
	import os
	import sys

	from trl import SFTTrainer
	from transformers import TrainingArguments, TextStreamer
	import pandas as pd
	```

	### Load the base model
	```python
	model_config = {"model_name": "unsloth/llama-3-8b", "load_in_4bit": False}
	max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
	dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
	load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
	model_name = "polyglots/SinLlama_v01"
	```

	### Load the model
	```python
	model, _ = FastLanguageModel.from_pretrained(
	model_name = model_name,
	max_seq_length = max_seq_length,
	dtype = dtype,
	load_in_4bit = load_in_4bit,
	resize_model_vocab=139336 # Size of new vocab
	)
	```

	### Load our extended tokenizer
	```python
	tokenizer = AutoTokenizer.from_pretrained("polyglots/Extended-Sinhala-LLaMA")
	model.resize_token_embeddings(len(tokenizer))
	```

	## Training Details

	### Training Data
	- Pretraining: 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}.
	- Fine-tuning:
	- Sentiment Analysis (~12.5K samples)
	- Writing Style Classification (~9K samples)
	- Sinhala News Category Classification (~3.3K samples)

	### Training Procedure
	- Tokenizer: Extended Llama-3 tokenizer with Sinhala-specific tokens using `tiktoken`.
	- Continual Pretraining: Using codebase from Chinese-Llama, block size reduced from 1024 → 512 for GPU compatibility.
	- Fine-tuning: LoRA-based parameter-efficient finetuning with Alpaca-style prompts.

	#### Training Hyperparameters
	- Mixed precision (fp16/bf16) training
	- LoRA adapters for efficient fine-tuning

	---

	## Evaluation

	### Testing Data
	- Sinhala sentiment, writing style, and news categorization datasets.
	- Splits: 80/10/10 with stratified sampling.

	### Metrics
	- Precision, Recall, F1-score

	### Results

	\| Model \| Writing Style F1 \| News F1 \| Sentiment F1 \|
	\|-------------------------\|-----------------\|---------\|--------------\|
	\| Llama-3-8B base \| 24.50 \| 19.03 \| 36.29 \|
	\| Llama-3-8B base finetuned \| 49.45 \| 61.14 \| 59.35 \|
	\| Llama-3-8B instruct finetuned \| 42.25 \| 47.81 \| 68.78 \|
	\| SinLlama finetuned \| 58.89 \| 86.40 \| 72.47 \|

	Summary: SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}.

	---

	## Environmental Impact

	- Hardware Type: GPUs (not specified, likely A100-class)
	- Hours used: Not reported
	- Cloud Provider: CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2}
	- Compute Region: India & Sri Lanka
	- Carbon Emitted: Not reported

	---

	## Technical Specifications

	### Model Architecture and Objective
	- Decoder-only transformer (Llama-3-8B backbone)
	- Autoregressive pretraining objective
	- Sinhala vocabulary-extended tokenizer

	### Compute Infrastructure
	- Hardware: GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3}
	- Software: Hugging Face `transformers`, PEFT, LoRA, `tiktoken`

	---

	## Citation

	BibTeX:
	```bibtex
	@article{aravinda2025sinllama,
	title={SinLlama-A Large Language Model for Sinhala},
	author={Aravinda, H W K and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Ranathunga, Surangika and Kaur, Rishemjit},
	journal={arXiv preprint arXiv:2508.09115},
	year={2025}
	}
	```

	APA:
	Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). SinLlama -- A Large Language Model for Sinhala. arXiv preprint arXiv:2508.09115.

	---

	## Model Card Authors
	- Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4}

	## Model Card Contact
	- [polyglots on Hugging Face](https://huggingface.co/polyglots)

	### Framework versions
	- PEFT 0.13.2
	- Transformers (latest at time of release)