polyglots
/

SinLlama_v01

PEFT

Safetensors

Sinhala

Model card Files Files and versions

xet

Community

NisansaDdS commited on Aug 27, 2025

Commit

0438d73

verified ·

1 Parent(s): 6c741fd

Updated the card

Browse files

Files changed (1) hide show

README.md +96 -129

README.md CHANGED Viewed

@@ -1,202 +1,169 @@
----
 base_model: meta-llama/Meta-Llama-3-8B
 library_name: peft
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
 ### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
 ### Model Architecture and Objective
-[More Information Needed]
 ### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
-[More Information Needed]
 **APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]
 ### Framework versions
-- PEFT 0.13.2

 base_model: meta-llama/Meta-Llama-3-8B
 library_name: peft
 ---
+# Model Card for SinLlama
+SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B.
+---
 ## Model Details
 ### Model Description
+SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus.
+Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models.
+- **Developed by:** H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1}
+- **Funded by:** CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2}
+- **Shared by:** Polyglots team
+- **Model type:** Decoder-only autoregressive transformer LLM
+- **Language(s) (NLP):** Sinhala (සිංහල)
+- **License:** Same as base model (Meta Llama 3 license)
+- **Finetuned from model:** meta-llama/Meta-Llama-3-8B
+### Model Sources
+- **Repository:** [Hugging Face - SinLlama v01](https://huggingface.co/polyglots/SinLlama_v01)
+- **Paper:** [SinLlama: A Large Language Model for Sinhala](https://arxiv.org/abs/2508.09115v2)
+- **Dataset:** [MADLAD+CulturaX (cleaned Sinhala subset)](https://huggingface.co/datasets/polyglots/MADLAD_CulturaX_cleaned)
+---
 ## Uses
 ### Direct Use
+- Sinhala text generation
+- Sinhala text classification
+- Sentiment analysis, news categorization, and writing style classification
+### Downstream Use
+- Instruction tuning for Sinhala dialogue systems
+- Cross-lingual applications involving Sinhala
+- Educational and research applications in low-resource NLP
 ### Out-of-Scope Use
+- Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala)
+- Sensitive domains (e.g., healthcare, legal) without rigorous validation
+- Malicious generation (hate speech, disinformation)
+---
 ## Bias, Risks, and Limitations
+- **Bias:** Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases).
+- **Limitations:** Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging.
+- **Risk:** Misuse in spreading misinformation or biased outputs in Sinhala.
 ### Recommendations
+Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is recommended for robustness.
+---
 ## How to Get Started with the Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "polyglots/SinLlama_v01"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+text = "සිංහල නවතම තාක්‍ෂණ විකාශනය පිළිබඳ පුවතක්"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=100)
+print(tokenizer.decode(outputs[0]))
 ## Training Details
 ### Training Data
+- **Pretraining:** 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}.
+- **Fine-tuning:**
+  - Sentiment Analysis (~12.5K samples)
+  - Writing Style Classification (~9K samples)
+  - Sinhala News Category Classification (~3.3K samples)
 ### Training Procedure
+- **Tokenizer:** Extended Llama-3 tokenizer with Sinhala-specific tokens using `tiktoken`.
+- **Continual Pretraining:** Using codebase from Chinese-Llama, block size reduced from 1024 → 512 for GPU compatibility.
+- **Fine-tuning:** LoRA-based parameter-efficient finetuning with Alpaca-style prompts.
 #### Training Hyperparameters
+- Mixed precision (fp16/bf16) training
+- LoRA adapters for efficient fine-tuning
+---
 ## Evaluation
+### Testing Data
+- Sinhala sentiment, writing style, and news categorization datasets.
+- Splits: 80/10/10 with stratified sampling.
+### Metrics
+- Precision, Recall, F1-score
 ### Results
+| Model                  | Writing Style F1 | News F1 | Sentiment F1 |
+|-------------------------|-----------------|---------|--------------|
+| Llama-3-8B base         | 24.50           | 19.03   | 36.29        |
+| Llama-3-8B base finetuned | 49.45        | 61.14   | 59.35        |
+| Llama-3-8B instruct finetuned | 42.25   | 47.81   | 68.78        |
+| **SinLlama finetuned**  | **58.89**       | **86.40** | **72.47**    |
+**Summary:** SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}.
+---
 ## Environmental Impact
+- **Hardware Type:** GPUs (not specified, likely A100-class)
+- **Hours used:** Not reported
+- **Cloud Provider:** CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2}
+- **Compute Region:** India & Sri Lanka
+- **Carbon Emitted:** Not reported
+---
+## Technical Specifications
 ### Model Architecture and Objective
+- Decoder-only transformer (Llama-3-8B backbone)
+- Autoregressive pretraining objective
+- Sinhala vocabulary-extended tokenizer
 ### Compute Infrastructure
+- **Hardware:** GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3}
+- **Software:** Hugging Face `transformers`, PEFT, LoRA, `tiktoken`
+---
+## Citation
 **BibTeX:**
+```bibtex
+@article{aravinda2025sinllama,
+  title={SinLlama -- A Large Language Model for Sinhala},
+  author={Aravinda, H.W.K. and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Kaur, Rishemjit and Ranathunga, Surangika},
+  journal={arXiv preprint arXiv:2508.09115},
+  year={2025}
+}
+```
 **APA:**
+Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). *SinLlama -- A Large Language Model for Sinhala*. arXiv preprint arXiv:2508.09115.
+---
+## Model Card Authors
+- Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4}
 ## Model Card Contact
+- [polyglots on Hugging Face](https://huggingface.co/polyglots)
 ### Framework versions
+- PEFT 0.13.2
+- Transformers (latest at time of release)