SerendipLLM News Classifier π
A Sinhala news article classifier built on top of a fine-tuned 8B LLaMA model that beats SinLlama β the previous state-of-the-art for Sinhala news classification.
π Benchmark Results
| Model | Precision | Recall | F1 |
|---|---|---|---|
| SerendipLLM-news-classifier (ours) | 90.223 | 90.0 | 89.939 |
| SinLlama (baseline) | 89.033 | 86.787 | 86.402 |
β +3.537 F1 points above SinLlama
π° Per-Class Results
| Category | Sinhala | Precision | Recall | F1 |
|---|---|---|---|---|
| Business | ΰ·ΰ·βΰΆΊΰ·ΰΆ΄ΰ·ΰΆ» | 90.0 | 90.0 | 90.0 |
| Politics | ΰΆ―ΰ·ΰ·ΰΆ΄ΰ·ΰΆ½ΰΆ± | 94.3 | 82.5 | 88.0 |
| Entertainment | ΰ·ΰ·ΰΆ±ΰ·ΰΆ―ΰ·ΰ·ΰ·ΰ·ΰ·ΰΆ―ΰΆΊ | 85.0 | 85.0 | 85.0 |
| Sports | ΰΆΰ·βΰΆ»ΰ·ΰΆ©ΰ· | 87.0 | 100.0 | 93.0 |
| Technology | ΰΆΰ·ΰΆΰ·ΰ·ΰΆ« | 94.9 | 92.5 | 93.7 |
ποΈ Architecture
This model uses a classification head approach instead of text generation. The 8B LLaMA model acts as a deep Sinhala language encoder, and a small linear layer on top maps the last token hidden state directly to one of 5 categories.
- Base model: Chamaka8/Serendip-LLM-CPT-SFT-v2 (8B LLaMA, Sinhala CPT+SFT)
- LoRA: r=32, alpha=64, all projection layers (q/k/v/o/gate/up/down)
- Classifier head: Linear(4096 β 5) saved as classifier_head.pt
- Training: 15 epochs, balanced classes via oversampling, cosine LR schedule
- Optimizer: AdamW β classifier LR=2e-4, LoRA LR=5e-5
π Quick Start
import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from torch import nn
from huggingface_hub import hf_hub_download
BASE = "Chamaka8/Serendip-LLM-CPT-SFT-v2"
LORA = "Chamaka8/SerendipLLM-news-classifier"
LABELS = ["ΰ·ΰ·βΰΆΊΰ·ΰΆ΄ΰ·ΰΆ»", "ΰΆ―ΰ·ΰ·ΰΆ΄ΰ·ΰΆ½ΰΆ±", "ΰ·ΰ·ΰΆ±ΰ·ΰΆ―ΰ·ΰ·ΰ·ΰ·ΰ·ΰΆ―ΰΆΊ", "ΰΆΰ·βΰΆ»ΰ·ΰΆ©ΰ·", "ΰΆΰ·ΰΆΰ·ΰ·ΰΆ«"]
DEVICE = "cuda:0"
TOKEN = "your_hf_token"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer = PreTrainedTokenizerFast.from_pretrained(BASE, token=TOKEN)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(BASE, token=TOKEN,
quantization_config=bnb, device_map={"":DEVICE})
model = PeftModel.from_pretrained(base_model, LORA, token=TOKEN)
model.eval()
head_path = hf_hub_download(repo_id=LORA, filename="classifier_head.pt", token=TOKEN)
classifier = nn.Linear(4096, 5).to(DEVICE).float()
classifier.load_state_dict({k: v.to(DEVICE) for k,v in torch.load(head_path, map_location="cpu").items()})
classifier.eval()
def classify(text):
enc = tokenizer(f"ΰΆ΄ΰ·βΰΆ»ΰ·ΰ·ΰΆΰ·ΰΆΰ·: {text[:400]}", return_tensors="pt",
truncation=True, max_length=256, padding="max_length").to(DEVICE)
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
hidden = out.hidden_states[-1]
seq_len = enc["attention_mask"].sum(dim=1) - 1
last = hidden[torch.arange(1), seq_len].float().to(DEVICE)
return LABELS[classifier(last).argmax(dim=-1).item()]
print(classify("ΰ·ΰ·βΰΆ»ΰ· ΰΆ½ΰΆΰΆΰ· ΰΆΰ·βΰΆ»ΰ·ΰΆΰΆ§ΰ· ΰΆΰΆ«ΰ·ΰΆ©ΰ·ΰΆΊΰΆΈ ΰΆ
ΰΆ― ΰΆ’ΰΆΊ ΰΆΰΆΰ·ΰΆΰ·ΰΆΊ."))
# Output: ΰΆΰ·βΰΆ»ΰ·ΰΆ©ΰ·
π Repository Structure
SerendipLLM-news-classifier/
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA weights (336MB)
βββ classifier_head.pt # Classification head weights (83KB)
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json
βββ special_tokens_map.json
βββ scripts/
βββ train_classifier.py # Full training script
βββ eval_real_news.py # Evaluation with 10 real news examples
π Training Details
| Parameter | Value |
|---|---|
| Base model | Chamaka8/Serendip-LLM-CPT-SFT-v2 |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Epochs | 15 |
| Batch size | 8 |
| Classifier LR | 2e-4 |
| LoRA LR | 5e-5 |
| Max sequence length | 256 |
| Training samples | 4885 (balanced) |
| Eval samples | 200 (40 per class) |
| GPU | NVIDIA RTX A4500 (20GB) |
π Dataset
Trained on NLPC-UOM/Sinhala-News-Category-classification
π Related Models
- Chamaka8/Serendip-LLM-CPT-SFT-v2 β Base model
- Chamaka8/SerendipLLM-news-final β Text generation v1 (F1=63.7)
- Chamaka8/SerendipLLM-news-final-v2 β Text generation v2 (F1=58.6)
π Citation
@misc{serendipllm-news-classifier-2025,
author = {Chamaka Amarasinghe},
title = {SerendipLLM News Classifier: Sinhala News Classification with Classification Head},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Chamaka8/SerendipLLM-news-classifier}
}
- Downloads last month
- 58
Model tree for Chamaka8/SerendipLLM-news-classifier
Base model
Chamaka8/serendib-llm-cpt-llama3-8b
Finetuned
Chamaka8/Serendip-LLM-CPT-SFT-v2