Sinhala LLaMA 1B (Continual Pretrained on Randomly Curated Corpus)
A continually pretrained LLaMA 3.2 1B model on a randomly sampled Sinhala corpus, with an extended Sinhala tokenizer. This is the random sampling baseline in a three-way corpus diversity ablation study.
Model variants in this series:
Minuri/sinhala-llama-1b-corpus-news- Trained on news-only corpus (Model A) | Perplexity: 14.67Minuri/sinhala-llama-1b-corpus-random- Trained on random corpus (Model B) | Perplexity: 10.86 - this repoMinuri/sinhala-llama-1b-corpus-diverse- Trained on diversity-optimized corpus (Model C) | Perplexity: 10.49 ✅ Best
Model Description
This model is the result of two-stage continual pretraining (CPT) of meta-llama/Llama-3.2-1B on Minuri/sinhala-corpus-b-random-1m (1M randomly sampled Sinhala sentences), using an extended tokenizer (Minuri/sinhala-llama-3.2-1b-tokenizer) that adds 7,843 Sinhala-specific tokens to the base vocabulary. It serves as a random sampling baseline to compare against the domain-controlled (A) and diversity-optimized (C) corpora.
Training Details
| Parameter | Value |
|---|---|
| Base model | meta-llama/Llama-3.2-1B |
| Tokenizer | Minuri/sinhala-llama-3.2-1b-tokenizer |
| Training corpus | Minuri/sinhala-corpus-b-random-1m (1M sentences) |
| Training approach | Two-stage continual pretraining |
| Extended vocab size | 136,099 tokens |
| Token reduction on Sinhala | ~70.4% |
Evaluation Results
Evaluated on Minuri/sinhala-test-set-50k using token-level perplexity. Baseline perplexity is measured using the base LLaMA 3.2 1B model with the extended Sinhala tokenizer - the high baseline reflects the mismatch between the base model weights and the new vocabulary. When evaluated with the original base tokenizer, the base model perplexity is ~2.
| Model | Corpus | Baseline PPL | Trained PPL | Improvement | Loss |
|---|---|---|---|---|---|
Model A (corpus-news) |
News-only | 66,448.86 | 14.67 | 99.98% | 2.686 |
| Model B (this repo) | Random | 66,454.47 | 10.86 | 99.98% | 2.385 |
Model C (corpus-diverse) |
Diversity-optimized | 66,439.92 | 10.49 | 99.98% | 2.351 ✅ |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-1b-corpus-random")
model = AutoModelForCausalLM.from_pretrained("Minuri/sinhala-llama-1b-corpus-random")
inputs = tokenizer("ශ්රී ලංකාව", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Uses
- Sinhala text generation and language modelling
- Random sampling baseline for corpus diversity ablation studies
- Low-resource NLP research and benchmarking
Limitations
- 1B parameter model with limited reasoning capability without SFT
- Not instruction-tuned - outputs are continuations, not responses
Related Repositories
| Repo | Description |
|---|---|
Minuri/sinhala-llama-3.2-1b-tokenizer |
Extended Sinhala tokenizer |
Minuri/sinhala-corpus-b-random-1m |
Training corpus |
Minuri/sinhala-test-set-50k |
Evaluation test set |
Minuri/sinhala-corpus-a-news-1m |
Corpus A - news-only |
Minuri/sinhala-corpus-c-diverse-1m |
Corpus C - diversity-optimized |
Minuri/diverse_sinhala_dataset |
Full parent corpus |
License
This model is derived from meta-llama/Llama-3.2-1B and is subject to the LLaMA 3.2 Community License.
- Downloads last month
- 7