--- library_name: transformers tags: - text-classification - distilbert - news-classification - sri-lanka base_model: - distilbert/distilbert-base-uncased --- ## Model Details **Model Name:** `Ginidu2003/Distilbert-Base-News-classifier` **Model Type:** Text Classification **Base Model:** `distilbert/distilbert-base-uncased` **Language(s):** English **Finetuned from model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) ### Model Description This is a fine-tuned DistilBERT model designed to classify English news articles into **5 categories**: - Business - Opinion - Political gossip - Sports - World news ## Uses ### Direct Use - Classify news articles into one of the five predefined categories. - Suitable for English news (Like Daily Mirror style). ### Downstream Use - Can be integrated into web applications (Streamlit/Gradio) for automated news categorization. - Can be used for real-time news filtering and topic-based news recommendation systems. ### Out-of-Scope Use - Not intended for other languages. - Not trained for sentiment analysis, fake news detection, or hate speech detection. - Not suitable for very short texts. ## Bias, Risks, and Limitations - The model is trained only on **Daily Mirror** news data, so it may perform poorly on other news sources or different writing styles. - Potential bias towards Sri Lankan context and English used in Sri Lankan media. - Performance may degrade on very long or very short articles. ## How to Get Started with the Model ```python from transformers import pipeline classifier = pipeline( "text-classification", model="Ginidu2003/Distilbert-Base-News-classifier" ) result = classifier("Your news article text here...") print(result) ``` ## Training Details ### Training Data - **Dataset**: Daily Mirror Sri Lankan English news (2024–2025) - **Total samples**: ~1,018 articles (after preprocessing and deduplication) - **Classes**: 5 balanced categories (Business, Opinion, Political gossip, Sports, World news) - **Preprocessing**: Lowercasing, punctuation removal, lemmatization ### Training Procedure - **Framework**: Hugging Face Transformers + Trainer API - **Base Model**: `distilbert/distilbert-base-uncased` - **Epochs**: 4 - **Batch Size**: 8 - **Learning Rate**: 2e-5 - **Validation Accuracy**: **90.19%** ## Evaluation **Validation Set Results (20% hold-out):** - **Accuracy**: **91.18%** - Model shows strong and consistent performance across all 5 classes. ## Environmental Impact - Training was done on a single GPU (T4 GPU on Colab) - Estimated carbon emissions: Very low (small model + small dataset)