Ginidu2003's picture
Update README.md
73bdc8b verified
metadata
library_name: transformers
tags:
  - text-classification
  - distilbert
  - news-classification
  - sri-lanka
base_model:
  - distilbert/distilbert-base-uncased

Model Details

Model Name: Ginidu2003/Distilbert-Base-News-classifier
Model Type: Text Classification
Base Model: distilbert/distilbert-base-uncased
Language(s): English
Finetuned from model: distilbert/distilbert-base-uncased

Model Description

This is a fine-tuned DistilBERT model designed to classify English news articles into 5 categories:

  • Business
  • Opinion
  • Political gossip
  • Sports
  • World news

Uses

Direct Use

  • Classify news articles into one of the five predefined categories.
  • Suitable for English news (Like Daily Mirror style).

Downstream Use

  • Can be integrated into web applications (Streamlit/Gradio) for automated news categorization.
  • Can be used for real-time news filtering and topic-based news recommendation systems.

Out-of-Scope Use

  • Not intended for other languages.
  • Not trained for sentiment analysis, fake news detection, or hate speech detection.
  • Not suitable for very short texts.

Bias, Risks, and Limitations

  • The model is trained only on Daily Mirror news data, so it may perform poorly on other news sources or different writing styles.
  • Potential bias towards Sri Lankan context and English used in Sri Lankan media.
  • Performance may degrade on very long or very short articles.

How to Get Started with the Model

from transformers import pipeline

classifier = pipeline(
    "text-classification", 
    model="Ginidu2003/Distilbert-Base-News-classifier"
)

result = classifier("Your news article text here...")
print(result)

Training Details

Training Data

  • Dataset: Daily Mirror Sri Lankan English news (2024–2025)
  • Total samples: ~1,018 articles (after preprocessing and deduplication)
  • Classes: 5 balanced categories (Business, Opinion, Political gossip, Sports, World news)
  • Preprocessing: Lowercasing, punctuation removal, lemmatization

Training Procedure

  • Framework: Hugging Face Transformers + Trainer API
  • Base Model: distilbert/distilbert-base-uncased
  • Epochs: 4
  • Batch Size: 8
  • Learning Rate: 2e-5
  • Validation Accuracy: 90.19%

Evaluation

Validation Set Results (20% hold-out):

  • Accuracy: 91.18%
  • Model shows strong and consistent performance across all 5 classes.

Environmental Impact

  • Training was done on a single GPU (T4 GPU on Colab)
  • Estimated carbon emissions: Very low (small model + small dataset)