File size: 2,686 Bytes

9566bad
 
ee1252c
 
 
 
 
73bdc8b
 
9566bad
 
 
 
 
 
ee1252c
 
 
 
 
9566bad
ee1252c
 
9566bad
ee1252c
 
 
 
 
9566bad
 
 
 
 
 
ee1252c
 
9566bad
ee1252c
 
 
9566bad
 
ee1252c
 
 
9566bad
 
ee1252c
 
 
9566bad
 
 
ee1252c
 
9566bad
ee1252c
 
 
 
9566bad
ee1252c
 
 
9566bad
 
 
ee1252c
 
 
 
9566bad
 
ee1252c
 
 
 
 
974885b
9566bad
 
 
ee1252c
 
 
9566bad
 
ee1252c

---
library_name: transformers
tags:
- text-classification
- distilbert
- news-classification
- sri-lanka
base_model:
- distilbert/distilbert-base-uncased
---



## Model Details

**Model Name:** `Ginidu2003/Distilbert-Base-News-classifier`  
**Model Type:** Text Classification  
**Base Model:** `distilbert/distilbert-base-uncased`  
**Language(s):** English    
**Finetuned from model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)

### Model Description
This is a fine-tuned DistilBERT model designed to classify English news articles into **5 categories**:

- Business  
- Opinion  
- Political gossip  
- Sports  
- World news  



## Uses

### Direct Use
- Classify news articles into one of the five predefined categories.
- Suitable for  English news (Like Daily Mirror style).

### Downstream Use
- Can be integrated into web applications (Streamlit/Gradio) for automated news categorization.
- Can be used for real-time news filtering and topic-based news recommendation systems.

### Out-of-Scope Use
- Not intended for other languages.
- Not trained for sentiment analysis, fake news detection, or hate speech detection.
- Not suitable for very short texts.

## Bias, Risks, and Limitations
- The model is trained only on **Daily Mirror** news data, so it may perform poorly on other news sources or different writing styles.
- Potential bias towards Sri Lankan context and English used in Sri Lankan media.
- Performance may degrade on very long or very short articles.

## How to Get Started with the Model

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification", 
    model="Ginidu2003/Distilbert-Base-News-classifier"
)

result = classifier("Your news article text here...")
print(result)
```
## Training Details

### Training Data
- **Dataset**: Daily Mirror Sri Lankan English news (2024–2025)
- **Total samples**: ~1,018 articles (after preprocessing and deduplication)
- **Classes**: 5 balanced categories (Business, Opinion, Political gossip, Sports, World news)
- **Preprocessing**: Lowercasing, punctuation removal, lemmatization

### Training Procedure
- **Framework**: Hugging Face Transformers + Trainer API
- **Base Model**: `distilbert/distilbert-base-uncased`
- **Epochs**: 4
- **Batch Size**: 8
- **Learning Rate**: 2e-5
- **Validation Accuracy**: **90.19%**

## Evaluation

**Validation Set Results (20% hold-out):**
- **Accuracy**: **91.18%**
- Model shows strong and consistent performance across all 5 classes.

## Environmental Impact
- Training was done on a single GPU (T4 GPU on Colab)
- Estimated carbon emissions: Very low (small model + small dataset)