|
|
--- |
|
|
datasets: |
|
|
- polyglots/MADLAD_CulturaX_cleaned |
|
|
language: |
|
|
- si |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
base_model: |
|
|
- meta-llama/Meta-Llama-3-8B |
|
|
library_name: peft |
|
|
--- |
|
|
base_model: meta-llama/Meta-Llama-3-8B |
|
|
library_name: peft |
|
|
--- |
|
|
|
|
|
# Model Card for SinLlama |
|
|
|
|
|
SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B. |
|
|
|
|
|
*DISCLAIMER:* |
|
|
This is a base model, which has NOT been instruct-tuned. So you still need to do task-specific fine-tuning. |
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus. |
|
|
|
|
|
Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models. |
|
|
|
|
|
- **Developed by:** H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1} |
|
|
- **Funded by:** CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2} |
|
|
- **Shared by:** Polyglots team |
|
|
- **Model type:** Decoder-only autoregressive transformer LLM |
|
|
- **Language(s) (NLP):** Sinhala (සිංහල) |
|
|
- **License:** Same as base model (Meta Llama 3 license) |
|
|
- **Finetuned from model:** meta-llama/Meta-Llama-3-8B |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [Hugging Face - SinLlama v01](https://huggingface.co/polyglots/SinLlama_v01) |
|
|
- **Paper:** [SinLlama: A Large Language Model for Sinhala](https://arxiv.org/abs/2508.09115v2) |
|
|
- **Dataset:** [MADLAD+CulturaX (cleaned Sinhala subset)](https://huggingface.co/datasets/polyglots/MADLAD_CulturaX_cleaned) |
|
|
|
|
|
--- |
|
|
|
|
|
### SinLlama Model Creation |
|
|
 |
|
|
|
|
|
## Uses |
|
|
|
|
|
|
|
|
### Downstream Use |
|
|
- Instruction tuning for Sinhala dialogue systems, text classification, etc |
|
|
- Cross-lingual applications involving Sinhala |
|
|
- Educational and research applications in low-resource NLP |
|
|
|
|
|
### Out-of-Scope Use |
|
|
- Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala) |
|
|
- Sensitive domains (e.g., healthcare, legal) without rigorous validation |
|
|
- Malicious generation (hate speech, disinformation) |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- **Bias:** Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases). |
|
|
- **Limitations:** Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging. |
|
|
- **Risk:** Misuse in spreading misinformation or biased outputs in Sinhala. |
|
|
|
|
|
### Recommendations |
|
|
Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is required for robustness. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
### Install dependencies |
|
|
```python |
|
|
!pip install unsloth |
|
|
!pip install datasets==2.21.0 |
|
|
!pip install pandas==2.1.4 |
|
|
``` |
|
|
|
|
|
### Import dependencies |
|
|
```python |
|
|
from unsloth import FastLanguageModel, is_bfloat16_supported |
|
|
from transformers import TextStreamer, AutoTokenizer |
|
|
import torch |
|
|
from datasets import load_dataset, DatasetDict, concatenate_datasets, Dataset |
|
|
from collections import Counter, defaultdict |
|
|
import os |
|
|
import sys |
|
|
|
|
|
from trl import SFTTrainer |
|
|
from transformers import TrainingArguments, TextStreamer |
|
|
import pandas as pd |
|
|
``` |
|
|
|
|
|
### Load the base model |
|
|
```python |
|
|
model_config = {"model_name": "unsloth/llama-3-8b", "load_in_4bit": False} |
|
|
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! |
|
|
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ |
|
|
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False. |
|
|
model_name = "polyglots/SinLlama_v01" |
|
|
``` |
|
|
|
|
|
### Load the model |
|
|
```python |
|
|
model, _ = FastLanguageModel.from_pretrained( |
|
|
model_name = model_name, |
|
|
max_seq_length = max_seq_length, |
|
|
dtype = dtype, |
|
|
load_in_4bit = load_in_4bit, |
|
|
resize_model_vocab=139336 # Size of new vocab |
|
|
) |
|
|
``` |
|
|
|
|
|
### Load our extended tokenizer |
|
|
```python |
|
|
tokenizer = AutoTokenizer.from_pretrained("polyglots/Extended-Sinhala-LLaMA") |
|
|
model.resize_token_embeddings(len(tokenizer)) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Pretraining:** 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}. |
|
|
- **Fine-tuning:** |
|
|
- Sentiment Analysis (~12.5K samples) |
|
|
- Writing Style Classification (~9K samples) |
|
|
- Sinhala News Category Classification (~3.3K samples) |
|
|
|
|
|
### Training Procedure |
|
|
- **Tokenizer:** Extended Llama-3 tokenizer with Sinhala-specific tokens using `tiktoken`. |
|
|
- **Continual Pretraining:** Using codebase from Chinese-Llama, block size reduced from 1024 → 512 for GPU compatibility. |
|
|
- **Fine-tuning:** LoRA-based parameter-efficient finetuning with Alpaca-style prompts. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
- Mixed precision (fp16/bf16) training |
|
|
- LoRA adapters for efficient fine-tuning |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data |
|
|
- Sinhala sentiment, writing style, and news categorization datasets. |
|
|
- Splits: 80/10/10 with stratified sampling. |
|
|
|
|
|
### Metrics |
|
|
- Precision, Recall, F1-score |
|
|
|
|
|
### Results |
|
|
|
|
|
| Model | Writing Style F1 | News F1 | Sentiment F1 | |
|
|
|-------------------------|-----------------|---------|--------------| |
|
|
| Llama-3-8B base | 24.50 | 19.03 | 36.29 | |
|
|
| Llama-3-8B base finetuned | 49.45 | 61.14 | 59.35 | |
|
|
| Llama-3-8B instruct finetuned | 42.25 | 47.81 | 68.78 | |
|
|
| **SinLlama finetuned** | **58.89** | **86.40** | **72.47** | |
|
|
|
|
|
**Summary:** SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}. |
|
|
|
|
|
--- |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware Type:** GPUs (not specified, likely A100-class) |
|
|
- **Hours used:** Not reported |
|
|
- **Cloud Provider:** CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2} |
|
|
- **Compute Region:** India & Sri Lanka |
|
|
- **Carbon Emitted:** Not reported |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
- Decoder-only transformer (Llama-3-8B backbone) |
|
|
- Autoregressive pretraining objective |
|
|
- Sinhala vocabulary-extended tokenizer |
|
|
|
|
|
### Compute Infrastructure |
|
|
- **Hardware:** GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3} |
|
|
- **Software:** Hugging Face `transformers`, PEFT, LoRA, `tiktoken` |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
```bibtex |
|
|
@article{aravinda2025sinllama, |
|
|
title={SinLlama-A Large Language Model for Sinhala}, |
|
|
author={Aravinda, H W K and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Ranathunga, Surangika and Kaur, Rishemjit}, |
|
|
journal={arXiv preprint arXiv:2508.09115}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
**APA:** |
|
|
Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). *SinLlama -- A Large Language Model for Sinhala*. arXiv preprint arXiv:2508.09115. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
- Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4} |
|
|
|
|
|
## Model Card Contact |
|
|
- [polyglots on Hugging Face](https://huggingface.co/polyglots) |
|
|
|
|
|
### Framework versions |
|
|
- PEFT 0.13.2 |
|
|
- Transformers (latest at time of release) |