PEFT
Safetensors
Sinhala
SinLlama_v01 / README.md
suralk's picture
Update README.md
4f464b1 verified
---
datasets:
- polyglots/MADLAD_CulturaX_cleaned
language:
- si
metrics:
- precision
- recall
- f1
base_model:
- meta-llama/Meta-Llama-3-8B
library_name: peft
---
base_model: meta-llama/Meta-Llama-3-8B
library_name: peft
---
# Model Card for SinLlama
SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B.
*DISCLAIMER:*
This is a base model, which has NOT been instruct-tuned. So you still need to do task-specific fine-tuning.
---
## Model Details
### Model Description
SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus.
Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models.
- **Developed by:** H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1}
- **Funded by:** CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2}
- **Shared by:** Polyglots team
- **Model type:** Decoder-only autoregressive transformer LLM
- **Language(s) (NLP):** Sinhala (සිංහල)
- **License:** Same as base model (Meta Llama 3 license)
- **Finetuned from model:** meta-llama/Meta-Llama-3-8B
### Model Sources
- **Repository:** [Hugging Face - SinLlama v01](https://huggingface.co/polyglots/SinLlama_v01)
- **Paper:** [SinLlama: A Large Language Model for Sinhala](https://arxiv.org/abs/2508.09115v2)
- **Dataset:** [MADLAD+CulturaX (cleaned Sinhala subset)](https://huggingface.co/datasets/polyglots/MADLAD_CulturaX_cleaned)
---
### SinLlama Model Creation
![SinLlama Logo](asserts/SinLlama.png)
## Uses
### Downstream Use
- Instruction tuning for Sinhala dialogue systems, text classification, etc
- Cross-lingual applications involving Sinhala
- Educational and research applications in low-resource NLP
### Out-of-Scope Use
- Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala)
- Sensitive domains (e.g., healthcare, legal) without rigorous validation
- Malicious generation (hate speech, disinformation)
---
## Bias, Risks, and Limitations
- **Bias:** Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases).
- **Limitations:** Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging.
- **Risk:** Misuse in spreading misinformation or biased outputs in Sinhala.
### Recommendations
Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is required for robustness.
---
## How to Get Started with the Model
### Install dependencies
```python
!pip install unsloth
!pip install datasets==2.21.0
!pip install pandas==2.1.4
```
### Import dependencies
```python
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TextStreamer, AutoTokenizer
import torch
from datasets import load_dataset, DatasetDict, concatenate_datasets, Dataset
from collections import Counter, defaultdict
import os
import sys
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
import pandas as pd
```
### Load the base model
```python
model_config = {"model_name": "unsloth/llama-3-8b", "load_in_4bit": False}
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "polyglots/SinLlama_v01"
```
### Load the model
```python
model, _ = FastLanguageModel.from_pretrained(
model_name = model_name,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
resize_model_vocab=139336 # Size of new vocab
)
```
### Load our extended tokenizer
```python
tokenizer = AutoTokenizer.from_pretrained("polyglots/Extended-Sinhala-LLaMA")
model.resize_token_embeddings(len(tokenizer))
```
## Training Details
### Training Data
- **Pretraining:** 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}.
- **Fine-tuning:**
- Sentiment Analysis (~12.5K samples)
- Writing Style Classification (~9K samples)
- Sinhala News Category Classification (~3.3K samples)
### Training Procedure
- **Tokenizer:** Extended Llama-3 tokenizer with Sinhala-specific tokens using `tiktoken`.
- **Continual Pretraining:** Using codebase from Chinese-Llama, block size reduced from 1024 → 512 for GPU compatibility.
- **Fine-tuning:** LoRA-based parameter-efficient finetuning with Alpaca-style prompts.
#### Training Hyperparameters
- Mixed precision (fp16/bf16) training
- LoRA adapters for efficient fine-tuning
---
## Evaluation
### Testing Data
- Sinhala sentiment, writing style, and news categorization datasets.
- Splits: 80/10/10 with stratified sampling.
### Metrics
- Precision, Recall, F1-score
### Results
| Model | Writing Style F1 | News F1 | Sentiment F1 |
|-------------------------|-----------------|---------|--------------|
| Llama-3-8B base | 24.50 | 19.03 | 36.29 |
| Llama-3-8B base finetuned | 49.45 | 61.14 | 59.35 |
| Llama-3-8B instruct finetuned | 42.25 | 47.81 | 68.78 |
| **SinLlama finetuned** | **58.89** | **86.40** | **72.47** |
**Summary:** SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}.
---
## Environmental Impact
- **Hardware Type:** GPUs (not specified, likely A100-class)
- **Hours used:** Not reported
- **Cloud Provider:** CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2}
- **Compute Region:** India & Sri Lanka
- **Carbon Emitted:** Not reported
---
## Technical Specifications
### Model Architecture and Objective
- Decoder-only transformer (Llama-3-8B backbone)
- Autoregressive pretraining objective
- Sinhala vocabulary-extended tokenizer
### Compute Infrastructure
- **Hardware:** GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3}
- **Software:** Hugging Face `transformers`, PEFT, LoRA, `tiktoken`
---
## Citation
**BibTeX:**
```bibtex
@article{aravinda2025sinllama,
title={SinLlama-A Large Language Model for Sinhala},
author={Aravinda, H W K and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Ranathunga, Surangika and Kaur, Rishemjit},
journal={arXiv preprint arXiv:2508.09115},
year={2025}
}
```
**APA:**
Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). *SinLlama -- A Large Language Model for Sinhala*. arXiv preprint arXiv:2508.09115.
---
## Model Card Authors
- Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4}
## Model Card Contact
- [polyglots on Hugging Face](https://huggingface.co/polyglots)
### Framework versions
- PEFT 0.13.2
- Transformers (latest at time of release)