Sinhala LLaMA 1B (Continual Pretrained on Randomly Curated Corpus)

A continually pretrained LLaMA 3.2 1B model on a randomly sampled Sinhala corpus, with an extended Sinhala tokenizer. This is the random sampling baseline in a three-way corpus diversity ablation study.

Model variants in this series:

  • Minuri/sinhala-llama-1b-corpus-news - Trained on news-only corpus (Model A) | Perplexity: 14.67
  • Minuri/sinhala-llama-1b-corpus-random - Trained on random corpus (Model B) | Perplexity: 10.86 - this repo
  • Minuri/sinhala-llama-1b-corpus-diverse - Trained on diversity-optimized corpus (Model C) | Perplexity: 10.49 ✅ Best

Model Description

This model is the result of two-stage continual pretraining (CPT) of meta-llama/Llama-3.2-1B on Minuri/sinhala-corpus-b-random-1m (1M randomly sampled Sinhala sentences), using an extended tokenizer (Minuri/sinhala-llama-3.2-1b-tokenizer) that adds 7,843 Sinhala-specific tokens to the base vocabulary. It serves as a random sampling baseline to compare against the domain-controlled (A) and diversity-optimized (C) corpora.

Training Details

Parameter Value
Base model meta-llama/Llama-3.2-1B
Tokenizer Minuri/sinhala-llama-3.2-1b-tokenizer
Training corpus Minuri/sinhala-corpus-b-random-1m (1M sentences)
Training approach Two-stage continual pretraining
Extended vocab size 136,099 tokens
Token reduction on Sinhala ~70.4%

Evaluation Results

Evaluated on Minuri/sinhala-test-set-50k using token-level perplexity. Baseline perplexity is measured using the base LLaMA 3.2 1B model with the extended Sinhala tokenizer - the high baseline reflects the mismatch between the base model weights and the new vocabulary. When evaluated with the original base tokenizer, the base model perplexity is ~2.

Model Corpus Baseline PPL Trained PPL Improvement Loss
Model A (corpus-news) News-only 66,448.86 14.67 99.98% 2.686
Model B (this repo) Random 66,454.47 10.86 99.98% 2.385
Model C (corpus-diverse) Diversity-optimized 66,439.92 10.49 99.98% 2.351 ✅

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-1b-corpus-random")
model = AutoModelForCausalLM.from_pretrained("Minuri/sinhala-llama-1b-corpus-random")

inputs = tokenizer("ශ්‍රී ලංකාව", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Uses

  • Sinhala text generation and language modelling
  • Random sampling baseline for corpus diversity ablation studies
  • Low-resource NLP research and benchmarking

Limitations

  • 1B parameter model with limited reasoning capability without SFT
  • Not instruction-tuned - outputs are continuations, not responses

Related Repositories

Repo Description
Minuri/sinhala-llama-3.2-1b-tokenizer Extended Sinhala tokenizer
Minuri/sinhala-corpus-b-random-1m Training corpus
Minuri/sinhala-test-set-50k Evaluation test set
Minuri/sinhala-corpus-a-news-1m Corpus A - news-only
Minuri/sinhala-corpus-c-diverse-1m Corpus C - diversity-optimized
Minuri/diverse_sinhala_dataset Full parent corpus

License

This model is derived from meta-llama/Llama-3.2-1B and is subject to the LLaMA 3.2 Community License.

Downloads last month
7
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Minuri/sinhala-llama-1b-corpus-random

Finetuned
(910)
this model
Finetunes
1 model