NLPC-UOM/Sinhala-News-Category-classification
Viewer • Updated • 3.33k • 381 • 1
How to use Chamaka8/SerendipLLM-news-classifier with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("Chamaka8/Serendip-LLM-CPT-SFT-v2")
model = PeftModel.from_pretrained(base_model, "Chamaka8/SerendipLLM-news-classifier")A Sinhala news article classifier built on top of a fine-tuned 8B LLaMA model that beats SinLlama — the previous state-of-the-art for Sinhala news classification.
| Model | Precision | Recall | F1 |
|---|---|---|---|
| SerendipLLM-news-classifier (ours) | 90.223 | 90.0 | 89.939 |
| SinLlama (baseline) | 89.033 | 86.787 | 86.402 |
✅ +3.537 F1 points above SinLlama
| Category | Sinhala | Precision | Recall | F1 |
|---|---|---|---|---|
| Business | ව්යාපාර | 90.0 | 90.0 | 90.0 |
| Politics | දේශපාලන | 94.3 | 82.5 | 88.0 |
| Entertainment | විනෝදාස්වාදය | 85.0 | 85.0 | 85.0 |
| Sports | ක්රීඩා | 87.0 | 100.0 | 93.0 |
| Technology | තාක්ෂණ | 94.9 | 92.5 | 93.7 |
This model uses a classification head approach instead of text generation. The 8B LLaMA model acts as a deep Sinhala language encoder, and a small linear layer on top maps the last token hidden state directly to one of 5 categories.
import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from torch import nn
from huggingface_hub import hf_hub_download
BASE = "Chamaka8/Serendip-LLM-CPT-SFT-v2"
LORA = "Chamaka8/SerendipLLM-news-classifier"
LABELS = ["ව්යාපාර", "දේශපාලන", "විනෝදාස්වාදය", "ක්රීඩා", "තාක්ෂණ"]
DEVICE = "cuda:0"
TOKEN = "your_hf_token"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer = PreTrainedTokenizerFast.from_pretrained(BASE, token=TOKEN)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(BASE, token=TOKEN,
quantization_config=bnb, device_map={"":DEVICE})
model = PeftModel.from_pretrained(base_model, LORA, token=TOKEN)
model.eval()
head_path = hf_hub_download(repo_id=LORA, filename="classifier_head.pt", token=TOKEN)
classifier = nn.Linear(4096, 5).to(DEVICE).float()
classifier.load_state_dict({k: v.to(DEVICE) for k,v in torch.load(head_path, map_location="cpu").items()})
classifier.eval()
def classify(text):
enc = tokenizer(f"ප්රවෘත්ති: {text[:400]}", return_tensors="pt",
truncation=True, max_length=256, padding="max_length").to(DEVICE)
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
hidden = out.hidden_states[-1]
seq_len = enc["attention_mask"].sum(dim=1) - 1
last = hidden[torch.arange(1), seq_len].float().to(DEVICE)
return LABELS[classifier(last).argmax(dim=-1).item()]
print(classify("ශ්රී ලංකා ක්රිකට් කණ්ඩායම අද ජය ගත්තේය."))
# Output: ක්රීඩා
SerendipLLM-news-classifier/
├── adapter_config.json # LoRA configuration
├── adapter_model.safetensors # LoRA weights (336MB)
├── classifier_head.pt # Classification head weights (83KB)
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
├── special_tokens_map.json
└── scripts/
├── train_classifier.py # Full training script
└── eval_real_news.py # Evaluation with 10 real news examples
| Parameter | Value |
|---|---|
| Base model | Chamaka8/Serendip-LLM-CPT-SFT-v2 |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Epochs | 15 |
| Batch size | 8 |
| Classifier LR | 2e-4 |
| LoRA LR | 5e-5 |
| Max sequence length | 256 |
| Training samples | 4885 (balanced) |
| Eval samples | 200 (40 per class) |
| GPU | NVIDIA RTX A4500 (20GB) |
Trained on NLPC-UOM/Sinhala-News-Category-classification
@misc{serendipllm-news-classifier-2025,
author = {Chamaka Amarasinghe},
title = {SerendipLLM News Classifier: Sinhala News Classification with Classification Head},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Chamaka8/SerendipLLM-news-classifier}
}
Base model
Chamaka8/serendib-llm-cpt-llama3-8b