SerendipLLM News Classifier 🏆

A Sinhala news article classifier built on top of a fine-tuned 8B LLaMA model that beats SinLlama — the previous state-of-the-art for Sinhala news classification.

📊 Benchmark Results

Model Precision Recall F1
SerendipLLM-news-classifier (ours) 90.223 90.0 89.939
SinLlama (baseline) 89.033 86.787 86.402

+3.537 F1 points above SinLlama

📰 Per-Class Results

Category Sinhala Precision Recall F1
Business ව්‍යාපාර 90.0 90.0 90.0
Politics දේශපාලන 94.3 82.5 88.0
Entertainment විනෝදාස්වාදය 85.0 85.0 85.0
Sports ක්‍රීඩා 87.0 100.0 93.0
Technology තාක්ෂණ 94.9 92.5 93.7

🏗️ Architecture

This model uses a classification head approach instead of text generation. The 8B LLaMA model acts as a deep Sinhala language encoder, and a small linear layer on top maps the last token hidden state directly to one of 5 categories.

  • Base model: Chamaka8/Serendip-LLM-CPT-SFT-v2 (8B LLaMA, Sinhala CPT+SFT)
  • LoRA: r=32, alpha=64, all projection layers (q/k/v/o/gate/up/down)
  • Classifier head: Linear(4096 → 5) saved as classifier_head.pt
  • Training: 15 epochs, balanced classes via oversampling, cosine LR schedule
  • Optimizer: AdamW — classifier LR=2e-4, LoRA LR=5e-5

🚀 Quick Start

import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from torch import nn
from huggingface_hub import hf_hub_download

BASE   = "Chamaka8/Serendip-LLM-CPT-SFT-v2"
LORA   = "Chamaka8/SerendipLLM-news-classifier"
LABELS = ["ව්‍යාපාර", "දේශපාලන", "විනෝදාස්වාදය", "ක්‍රීඩා", "තාක්ෂණ"]
DEVICE = "cuda:0"
TOKEN  = "your_hf_token"

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer  = PreTrainedTokenizerFast.from_pretrained(BASE, token=TOKEN)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(BASE, token=TOKEN,
               quantization_config=bnb, device_map={"":DEVICE})
model      = PeftModel.from_pretrained(base_model, LORA, token=TOKEN)
model.eval()

head_path  = hf_hub_download(repo_id=LORA, filename="classifier_head.pt", token=TOKEN)
classifier = nn.Linear(4096, 5).to(DEVICE).float()
classifier.load_state_dict({k: v.to(DEVICE) for k,v in torch.load(head_path, map_location="cpu").items()})
classifier.eval()

def classify(text):
    enc = tokenizer(f"ප්‍රවෘත්ති: {text[:400]}", return_tensors="pt",
                   truncation=True, max_length=256, padding="max_length").to(DEVICE)
    with torch.no_grad():
        out = model(**enc, output_hidden_states=True)
    hidden  = out.hidden_states[-1]
    seq_len = enc["attention_mask"].sum(dim=1) - 1
    last    = hidden[torch.arange(1), seq_len].float().to(DEVICE)
    return LABELS[classifier(last).argmax(dim=-1).item()]

print(classify("ශ්‍රී ලංකා ක්‍රිකට් කණ්ඩායම අද ජය ගත්තේය."))
# Output: ක්‍රීඩා

📁 Repository Structure

SerendipLLM-news-classifier/
├── adapter_config.json          # LoRA configuration
├── adapter_model.safetensors    # LoRA weights (336MB)
├── classifier_head.pt           # Classification head weights (83KB)
├── tokenizer.json               # Tokenizer
├── tokenizer_config.json
├── special_tokens_map.json
└── scripts/
    ├── train_classifier.py      # Full training script
    └── eval_real_news.py        # Evaluation with 10 real news examples

📋 Training Details

Parameter Value
Base model Chamaka8/Serendip-LLM-CPT-SFT-v2
LoRA rank 32
LoRA alpha 64
Epochs 15
Batch size 8
Classifier LR 2e-4
LoRA LR 5e-5
Max sequence length 256
Training samples 4885 (balanced)
Eval samples 200 (40 per class)
GPU NVIDIA RTX A4500 (20GB)

📚 Dataset

Trained on NLPC-UOM/Sinhala-News-Category-classification

🔗 Related Models

📜 Citation

@misc{serendipllm-news-classifier-2025,
  author    = {Chamaka Amarasinghe},
  title     = {SerendipLLM News Classifier: Sinhala News Classification with Classification Head},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Chamaka8/SerendipLLM-news-classifier}
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chamaka8/SerendipLLM-news-classifier

Dataset used to train Chamaka8/SerendipLLM-news-classifier

Space using Chamaka8/SerendipLLM-news-classifier 1