Ginidu2003's picture
Update README.md
73bdc8b verified
---
library_name: transformers
tags:
- text-classification
- distilbert
- news-classification
- sri-lanka
base_model:
- distilbert/distilbert-base-uncased
---
## Model Details
**Model Name:** `Ginidu2003/Distilbert-Base-News-classifier`
**Model Type:** Text Classification
**Base Model:** `distilbert/distilbert-base-uncased`
**Language(s):** English
**Finetuned from model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
### Model Description
This is a fine-tuned DistilBERT model designed to classify English news articles into **5 categories**:
- Business
- Opinion
- Political gossip
- Sports
- World news
## Uses
### Direct Use
- Classify news articles into one of the five predefined categories.
- Suitable for English news (Like Daily Mirror style).
### Downstream Use
- Can be integrated into web applications (Streamlit/Gradio) for automated news categorization.
- Can be used for real-time news filtering and topic-based news recommendation systems.
### Out-of-Scope Use
- Not intended for other languages.
- Not trained for sentiment analysis, fake news detection, or hate speech detection.
- Not suitable for very short texts.
## Bias, Risks, and Limitations
- The model is trained only on **Daily Mirror** news data, so it may perform poorly on other news sources or different writing styles.
- Potential bias towards Sri Lankan context and English used in Sri Lankan media.
- Performance may degrade on very long or very short articles.
## How to Get Started with the Model
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="Ginidu2003/Distilbert-Base-News-classifier"
)
result = classifier("Your news article text here...")
print(result)
```
## Training Details
### Training Data
- **Dataset**: Daily Mirror Sri Lankan English news (2024–2025)
- **Total samples**: ~1,018 articles (after preprocessing and deduplication)
- **Classes**: 5 balanced categories (Business, Opinion, Political gossip, Sports, World news)
- **Preprocessing**: Lowercasing, punctuation removal, lemmatization
### Training Procedure
- **Framework**: Hugging Face Transformers + Trainer API
- **Base Model**: `distilbert/distilbert-base-uncased`
- **Epochs**: 4
- **Batch Size**: 8
- **Learning Rate**: 2e-5
- **Validation Accuracy**: **90.19%**
## Evaluation
**Validation Set Results (20% hold-out):**
- **Accuracy**: **91.18%**
- Model shows strong and consistent performance across all 5 classes.
## Environmental Impact
- Training was done on a single GPU (T4 GPU on Colab)
- Estimated carbon emissions: Very low (small model + small dataset)