mgulati3's picture
Update README.md
fb255b7 verified
---
library_name: transformers
tags: [text-classification, llm, huggingface, nlp, news, fine-tuning, gradio]
---
# 📰 NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning
A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup.
---
## Model Details
### Model Description
This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from [NPR](https://www.npr.org/). The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams.
- **Developed by:** Manan Gulati
- **Model type:** Transformer (text classification)
- **Language(s):** English
- **License:** MIT
- **Fine-tuned from model:** distilbert-base-uncased
### Model Sources
- **Repository:** https://github.com/mgulati3/Fine-Tune
- **Demo:** https://huggingface.co/spaces/mgulati3/news-classifier-ui
- **Model Hub:** https://huggingface.co/mgulati3/news-classifier-model
---
## Uses
### Direct Use
This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles.
### Out-of-Scope Use
- Not suitable for multi-label classification.
- Not recommended for non-news or informal text.
- May not perform well on non-English content.
---
## Bias, Risks, and Limitations
- The model is trained only on NPR articles, which may carry source-specific bias.
- Categories are limited to five; nuanced topics may not be accurately captured.
- Misclassifications may occur for ambiguous or mixed-topic content.
### Recommendations
Use prediction confidence scores to interpret results. Consider human review for sensitive applications.
---
## How to Get Started
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="mgulati3/news-classifier-model")
classifier("NASA's new moon mission will use AI to optimize fuel consumption.")
```
---
## Training Details
### Training Data
Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas.
### Training Procedure
- Tokenizer: LLaMA-compatible tokenizer
- Preprocessing: Lowercasing, truncation, padding
- Epochs: 4
- Optimizer: AdamW
- Batch size: 16
---
## Evaluation
### Testing Data
20% of the dataset was reserved for testing. Random stratified split was used.
### Metrics
- Accuracy (Train): 85%
- Accuracy (Test): 60%
- Metric: Accuracy (single-label, top-1)
### Results
The model performs well on domain-specific, labeled news content with distinguishable category patterns.
---
## Environmental Impact
- **Hardware Type:** Google Colab GPU (T4)
- **Hours used:** ~2.5
- **Cloud Provider:** Google
- **Compute Region:** US
- **Carbon Emitted:** Estimated ~0.2 kgCO2eq
---
## Technical Specifications
### Model Architecture
DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories.
### Compute Infrastructure
- Google Colab Pro
- Python 3.10
- Hugging Face Transformers 4.x
- PyTorch backend
---
## Citation
**APA:**
Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model
**BibTeX:**
@misc{gulati2025newssense,
author = {Gulati, Manan},
title = {NewsSense AI: Fine-tuned LLM for News Classification},
year = {2025},
url = {https://huggingface.co/mgulati3/news-classifier-model}
}
---
## Model Card Contact
For questions or collaborations: mgulati3@asu.edu