SerendipLLM News Classifier πŸ†

A Sinhala news article classifier built on top of a fine-tuned 8B LLaMA model that beats SinLlama β€” the previous state-of-the-art for Sinhala news classification.

πŸ“Š Benchmark Results

Model Precision Recall F1
SerendipLLM-news-classifier (ours) 90.223 90.0 89.939
SinLlama (baseline) 89.033 86.787 86.402

βœ… +3.537 F1 points above SinLlama

πŸ“° Per-Class Results

Category Sinhala Precision Recall F1
Business ΰ·€ΰ·Šβ€ΰΆΊΰ·ΰΆ΄ΰ·ΰΆ» 90.0 90.0 90.0
Politics ࢯේශࢴාࢽࢱ 94.3 82.5 88.0
Entertainment ΰ·€ΰ·’ΰΆ±ΰ·ΰΆ―ΰ·ΰ·ƒΰ·Šΰ·€ΰ·ΰΆ―ΰΆΊ 85.0 85.0 85.0
Sports ΰΆšΰ·Šβ€ΰΆ»ΰ·“ΰΆ©ΰ· 87.0 100.0 93.0
Technology ΰΆ­ΰ·ΰΆšΰ·Šΰ·‚ΰΆ« 94.9 92.5 93.7

πŸ—οΈ Architecture

This model uses a classification head approach instead of text generation. The 8B LLaMA model acts as a deep Sinhala language encoder, and a small linear layer on top maps the last token hidden state directly to one of 5 categories.

  • Base model: Chamaka8/Serendip-LLM-CPT-SFT-v2 (8B LLaMA, Sinhala CPT+SFT)
  • LoRA: r=32, alpha=64, all projection layers (q/k/v/o/gate/up/down)
  • Classifier head: Linear(4096 β†’ 5) saved as classifier_head.pt
  • Training: 15 epochs, balanced classes via oversampling, cosine LR schedule
  • Optimizer: AdamW β€” classifier LR=2e-4, LoRA LR=5e-5

πŸš€ Quick Start

import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from torch import nn
from huggingface_hub import hf_hub_download

BASE   = "Chamaka8/Serendip-LLM-CPT-SFT-v2"
LORA   = "Chamaka8/SerendipLLM-news-classifier"
LABELS = ["ΰ·€ΰ·Šβ€ΰΆΊΰ·ΰΆ΄ΰ·ΰΆ»", "ࢯේශࢴාࢽࢱ", "ΰ·€ΰ·’ΰΆ±ΰ·ΰΆ―ΰ·ΰ·ƒΰ·Šΰ·€ΰ·ΰΆ―ΰΆΊ", "ΰΆšΰ·Šβ€ΰΆ»ΰ·“ΰΆ©ΰ·", "ΰΆ­ΰ·ΰΆšΰ·Šΰ·‚ΰΆ«"]
DEVICE = "cuda:0"
TOKEN  = "your_hf_token"

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer  = PreTrainedTokenizerFast.from_pretrained(BASE, token=TOKEN)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(BASE, token=TOKEN,
               quantization_config=bnb, device_map={"":DEVICE})
model      = PeftModel.from_pretrained(base_model, LORA, token=TOKEN)
model.eval()

head_path  = hf_hub_download(repo_id=LORA, filename="classifier_head.pt", token=TOKEN)
classifier = nn.Linear(4096, 5).to(DEVICE).float()
classifier.load_state_dict({k: v.to(DEVICE) for k,v in torch.load(head_path, map_location="cpu").items()})
classifier.eval()

def classify(text):
    enc = tokenizer(f"ΰΆ΄ΰ·Šβ€ΰΆ»ΰ·€ΰ·˜ΰΆ­ΰ·ŠΰΆ­ΰ·’: {text[:400]}", return_tensors="pt",
                   truncation=True, max_length=256, padding="max_length").to(DEVICE)
    with torch.no_grad():
        out = model(**enc, output_hidden_states=True)
    hidden  = out.hidden_states[-1]
    seq_len = enc["attention_mask"].sum(dim=1) - 1
    last    = hidden[torch.arange(1), seq_len].float().to(DEVICE)
    return LABELS[classifier(last).argmax(dim=-1).item()]

print(classify("ΰ·ΰ·Šβ€ΰΆ»ΰ·“ ΰΆ½ΰΆ‚ΰΆšΰ· ΰΆšΰ·Šβ€ΰΆ»ΰ·’ΰΆšΰΆ§ΰ·Š ࢚ࢫ්ࢩාࢺࢸ ΰΆ…ΰΆ― ΰΆ’ΰΆΊ ࢜ࢭ්ࢭේࢺ."))
# Output: ΰΆšΰ·Šβ€ΰΆ»ΰ·“ΰΆ©ΰ·

πŸ“ Repository Structure

SerendipLLM-news-classifier/
β”œβ”€β”€ adapter_config.json          # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors    # LoRA weights (336MB)
β”œβ”€β”€ classifier_head.pt           # Classification head weights (83KB)
β”œβ”€β”€ tokenizer.json               # Tokenizer
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ special_tokens_map.json
└── scripts/
    β”œβ”€β”€ train_classifier.py      # Full training script
    └── eval_real_news.py        # Evaluation with 10 real news examples

πŸ“‹ Training Details

Parameter Value
Base model Chamaka8/Serendip-LLM-CPT-SFT-v2
LoRA rank 32
LoRA alpha 64
Epochs 15
Batch size 8
Classifier LR 2e-4
LoRA LR 5e-5
Max sequence length 256
Training samples 4885 (balanced)
Eval samples 200 (40 per class)
GPU NVIDIA RTX A4500 (20GB)

πŸ“š Dataset

Trained on NLPC-UOM/Sinhala-News-Category-classification

πŸ”— Related Models

πŸ“œ Citation

@misc{serendipllm-news-classifier-2025,
  author    = {Chamaka Amarasinghe},
  title     = {SerendipLLM News Classifier: Sinhala News Classification with Classification Head},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Chamaka8/SerendipLLM-news-classifier}
}
Downloads last month
58
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Chamaka8/SerendipLLM-news-classifier

Dataset used to train Chamaka8/SerendipLLM-news-classifier

Spaces using Chamaka8/SerendipLLM-news-classifier 2