SerendipLLM News Classifier 🏆

A Sinhala news article classifier built on top of a fine-tuned 8B LLaMA model that beats SinLlama — the previous state-of-the-art for Sinhala news classification.

📊 Benchmark Results

Model	Precision	Recall	F1
SerendipLLM-news-classifier (ours)	90.223	90.0	89.939
SinLlama (baseline)	89.033	86.787	86.402

✅ +3.537 F1 points above SinLlama

📰 Per-Class Results

Category	Sinhala	Precision	Recall	F1
Business	ව්‍යාපාර	90.0	90.0	90.0
Politics	දේශපාලන	94.3	82.5	88.0
Entertainment	විනෝදාස්වාදය	85.0	85.0	85.0
Sports	ක්‍රීඩා	87.0	100.0	93.0
Technology	තාක්ෂණ	94.9	92.5	93.7

🏗️ Architecture

This model uses a classification head approach instead of text generation. The 8B LLaMA model acts as a deep Sinhala language encoder, and a small linear layer on top maps the last token hidden state directly to one of 5 categories.

Base model: Chamaka8/Serendip-LLM-CPT-SFT-v2 (8B LLaMA, Sinhala CPT+SFT)
LoRA: r=32, alpha=64, all projection layers (q/k/v/o/gate/up/down)
Classifier head: Linear(4096 → 5) saved as classifier_head.pt
Training: 15 epochs, balanced classes via oversampling, cosine LR schedule
Optimizer: AdamW — classifier LR=2e-4, LoRA LR=5e-5

🚀 Quick Start

import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from torch import nn
from huggingface_hub import hf_hub_download

BASE   = "Chamaka8/Serendip-LLM-CPT-SFT-v2"
LORA   = "Chamaka8/SerendipLLM-news-classifier"
LABELS = ["ව්‍යාපාර", "දේශපාලන", "විනෝදාස්වාදය", "ක්‍රීඩා", "තාක්ෂණ"]
DEVICE = "cuda:0"
TOKEN  = "your_hf_token"

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer  = PreTrainedTokenizerFast.from_pretrained(BASE, token=TOKEN)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(BASE, token=TOKEN,
               quantization_config=bnb, device_map={"":DEVICE})
model      = PeftModel.from_pretrained(base_model, LORA, token=TOKEN)
model.eval()

head_path  = hf_hub_download(repo_id=LORA, filename="classifier_head.pt", token=TOKEN)
classifier = nn.Linear(4096, 5).to(DEVICE).float()
classifier.load_state_dict({k: v.to(DEVICE) for k,v in torch.load(head_path, map_location="cpu").items()})
classifier.eval()

def classify(text):
    enc = tokenizer(f"ප්‍රවෘත්ති: {text[:400]}", return_tensors="pt",
                   truncation=True, max_length=256, padding="max_length").to(DEVICE)
    with torch.no_grad():
        out = model(**enc, output_hidden_states=True)
    hidden  = out.hidden_states[-1]
    seq_len = enc["attention_mask"].sum(dim=1) - 1
    last    = hidden[torch.arange(1), seq_len].float().to(DEVICE)
    return LABELS[classifier(last).argmax(dim=-1).item()]

print(classify("ශ්‍රී ලංකා ක්‍රිකට් කණ්ඩායම අද ජය ගත්තේය."))
# Output: ක්‍රීඩා

📁 Repository Structure

SerendipLLM-news-classifier/
├── adapter_config.json          # LoRA configuration
├── adapter_model.safetensors    # LoRA weights (336MB)
├── classifier_head.pt           # Classification head weights (83KB)
├── tokenizer.json               # Tokenizer
├── tokenizer_config.json
├── special_tokens_map.json
└── scripts/
    ├── train_classifier.py      # Full training script
    └── eval_real_news.py        # Evaluation with 10 real news examples

📋 Training Details

Parameter	Value
Base model	Chamaka8/Serendip-LLM-CPT-SFT-v2
LoRA rank	32
LoRA alpha	64
Epochs	15
Batch size	8
Classifier LR	2e-4
LoRA LR	5e-5
Max sequence length	256
Training samples	4885 (balanced)
Eval samples	200 (40 per class)
GPU	NVIDIA RTX A4500 (20GB)

📚 Dataset

Trained on NLPC-UOM/Sinhala-News-Category-classification

🔗 Related Models

Chamaka8/Serendip-LLM-CPT-SFT-v2 — Base model
Chamaka8/SerendipLLM-news-final — Text generation v1 (F1=63.7)
Chamaka8/SerendipLLM-news-final-v2 — Text generation v2 (F1=58.6)

📜 Citation

@misc{serendipllm-news-classifier-2025,
  author    = {Chamaka Amarasinghe},
  title     = {SerendipLLM News Classifier: Sinhala News Classification with Classification Head},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Chamaka8/SerendipLLM-news-classifier}
}

Downloads last month: 1

Model tree for Chamaka8/SerendipLLM-news-classifier

Base model

Chamaka8/serendib-llm-cpt-llama3-8b

Finetuned

Chamaka8/Serendip-LLM-CPT-SFT-v2

Adapter

(4)

this model

Chamaka8
/

SerendipLLM-news-classifier