File size: 3,837 Bytes
fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 ce75442 fb255b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
library_name: transformers
tags: [text-classification, llm, huggingface, nlp, news, fine-tuning, gradio]
---
# 📰 NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning
A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup.
---
## Model Details
### Model Description
This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from [NPR](https://www.npr.org/). The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams.
- **Developed by:** Manan Gulati
- **Model type:** Transformer (text classification)
- **Language(s):** English
- **License:** MIT
- **Fine-tuned from model:** distilbert-base-uncased
### Model Sources
- **Repository:** https://github.com/mgulati3/Fine-Tune
- **Demo:** https://huggingface.co/spaces/mgulati3/news-classifier-ui
- **Model Hub:** https://huggingface.co/mgulati3/news-classifier-model
---
## Uses
### Direct Use
This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles.
### Out-of-Scope Use
- Not suitable for multi-label classification.
- Not recommended for non-news or informal text.
- May not perform well on non-English content.
---
## Bias, Risks, and Limitations
- The model is trained only on NPR articles, which may carry source-specific bias.
- Categories are limited to five; nuanced topics may not be accurately captured.
- Misclassifications may occur for ambiguous or mixed-topic content.
### Recommendations
Use prediction confidence scores to interpret results. Consider human review for sensitive applications.
---
## How to Get Started
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="mgulati3/news-classifier-model")
classifier("NASA's new moon mission will use AI to optimize fuel consumption.")
```
---
## Training Details
### Training Data
Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas.
### Training Procedure
- Tokenizer: LLaMA-compatible tokenizer
- Preprocessing: Lowercasing, truncation, padding
- Epochs: 4
- Optimizer: AdamW
- Batch size: 16
---
## Evaluation
### Testing Data
20% of the dataset was reserved for testing. Random stratified split was used.
### Metrics
- Accuracy (Train): 85%
- Accuracy (Test): 60%
- Metric: Accuracy (single-label, top-1)
### Results
The model performs well on domain-specific, labeled news content with distinguishable category patterns.
---
## Environmental Impact
- **Hardware Type:** Google Colab GPU (T4)
- **Hours used:** ~2.5
- **Cloud Provider:** Google
- **Compute Region:** US
- **Carbon Emitted:** Estimated ~0.2 kgCO2eq
---
## Technical Specifications
### Model Architecture
DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories.
### Compute Infrastructure
- Google Colab Pro
- Python 3.10
- Hugging Face Transformers 4.x
- PyTorch backend
---
## Citation
**APA:**
Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model
**BibTeX:**
@misc{gulati2025newssense,
author = {Gulati, Manan},
title = {NewsSense AI: Fine-tuned LLM for News Classification},
year = {2025},
url = {https://huggingface.co/mgulati3/news-classifier-model}
}
---
## Model Card Contact
For questions or collaborations: mgulati3@asu.edu
|