Update README.md

73bdc8b verified 2 months ago

2.69 kB

library_name: transformers
tags:
  - text-classification
  - distilbert
  - news-classification
  - sri-lanka
base_model:
  - distilbert/distilbert-base-uncased

Model Details

Model Name: Ginidu2003/Distilbert-Base-News-classifier
Model Type: Text Classification
Base Model: distilbert/distilbert-base-uncased
Language(s): English
Finetuned from model: distilbert/distilbert-base-uncased

Model Description

This is a fine-tuned DistilBERT model designed to classify English news articles into 5 categories:

Business
Opinion
Political gossip
Sports
World news

Uses

Direct Use

Classify news articles into one of the five predefined categories.
Suitable for English news (Like Daily Mirror style).

Downstream Use

Can be integrated into web applications (Streamlit/Gradio) for automated news categorization.
Can be used for real-time news filtering and topic-based news recommendation systems.

Out-of-Scope Use

Not intended for other languages.
Not trained for sentiment analysis, fake news detection, or hate speech detection.
Not suitable for very short texts.

Bias, Risks, and Limitations

The model is trained only on Daily Mirror news data, so it may perform poorly on other news sources or different writing styles.
Potential bias towards Sri Lankan context and English used in Sri Lankan media.
Performance may degrade on very long or very short articles.

How to Get Started with the Model

from transformers import pipeline

classifier = pipeline(
    "text-classification", 
    model="Ginidu2003/Distilbert-Base-News-classifier"
)

result = classifier("Your news article text here...")
print(result)

Training Details

Training Data

Dataset: Daily Mirror Sri Lankan English news (2024–2025)
Total samples: ~1,018 articles (after preprocessing and deduplication)
Classes: 5 balanced categories (Business, Opinion, Political gossip, Sports, World news)
Preprocessing: Lowercasing, punctuation removal, lemmatization

Training Procedure

Framework: Hugging Face Transformers + Trainer API
Base Model: distilbert/distilbert-base-uncased
Epochs: 4
Batch Size: 8
Learning Rate: 2e-5
Validation Accuracy: 90.19%

Evaluation

Validation Set Results (20% hold-out):

Accuracy: 91.18%
Model shows strong and consistent performance across all 5 classes.

Environmental Impact

Training was done on a single GPU (T4 GPU on Colab)
Estimated carbon emissions: Very low (small model + small dataset)