Sinhala LLaMA 1B (Continual Pretrained on Randomly Curated Corpus)

A continually pretrained LLaMA 3.2 1B model on a randomly sampled Sinhala corpus, with an extended Sinhala tokenizer. This is the random sampling baseline in a three-way corpus diversity ablation study.

Model variants in this series:

Minuri/sinhala-llama-1b-corpus-news - Trained on news-only corpus (Model A) | Perplexity: 14.67

Minuri/sinhala-llama-1b-corpus-random - Trained on random corpus (Model B) | Perplexity: 10.86 - this repo

Minuri/sinhala-llama-1b-corpus-diverse - Trained on diversity-optimized corpus (Model C) | Perplexity: 10.49 ✅ Best

Model Description

This model is the result of two-stage continual pretraining (CPT) of meta-llama/Llama-3.2-1B on Minuri/sinhala-corpus-b-random-1m (1M randomly sampled Sinhala sentences), using an extended tokenizer (Minuri/sinhala-llama-3.2-1b-tokenizer) that adds 7,843 Sinhala-specific tokens to the base vocabulary. It serves as a random sampling baseline to compare against the domain-controlled (A) and diversity-optimized (C) corpora.

Training Details

Parameter	Value
Base model	`meta-llama/Llama-3.2-1B`
Tokenizer	`Minuri/sinhala-llama-3.2-1b-tokenizer`
Training corpus	`Minuri/sinhala-corpus-b-random-1m` (1M sentences)
Training approach	Two-stage continual pretraining
Extended vocab size	136,099 tokens
Token reduction on Sinhala	~70.4%

Evaluation Results

Evaluated on Minuri/sinhala-test-set-50k using token-level perplexity. Baseline perplexity is measured using the base LLaMA 3.2 1B model with the extended Sinhala tokenizer - the high baseline reflects the mismatch between the base model weights and the new vocabulary. When evaluated with the original base tokenizer, the base model perplexity is ~2.

Model	Corpus	Baseline PPL	Trained PPL	Improvement	Loss
Model A (`corpus-news`)	News-only	66,448.86	14.67	99.98%	2.686
Model B (this repo)	Random	66,454.47	10.86	99.98%	2.385
Model C (`corpus-diverse`)	Diversity-optimized	66,439.92	10.49	99.98%	2.351 ✅

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-1b-corpus-random")
model = AutoModelForCausalLM.from_pretrained("Minuri/sinhala-llama-1b-corpus-random")

inputs = tokenizer("ශ්‍රී ලංකාව", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Uses

Sinhala text generation and language modelling
Random sampling baseline for corpus diversity ablation studies
Low-resource NLP research and benchmarking

Limitations

1B parameter model with limited reasoning capability without SFT
Not instruction-tuned - outputs are continuations, not responses

Related Repositories

Repo	Description
`Minuri/sinhala-llama-3.2-1b-tokenizer`	Extended Sinhala tokenizer
`Minuri/sinhala-corpus-b-random-1m`	Training corpus
`Minuri/sinhala-test-set-50k`	Evaluation test set
`Minuri/sinhala-corpus-a-news-1m`	Corpus A - news-only
`Minuri/sinhala-corpus-c-diverse-1m`	Corpus C - diversity-optimized
`Minuri/diverse_sinhala_dataset`	Full parent corpus

License

This model is derived from meta-llama/Llama-3.2-1B and is subject to the LLaMA 3.2 Community License.

Downloads last month: 5

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for Minuri/sinhala-llama-1b-corpus-random

Base model

meta-llama/Llama-3.2-1B

Finetuned

(937)

this model

Finetunes

1 model