File size: 3,837 Bytes
fb255b7
ce75442
 
fb255b7
ce75442
 
fb255b7
ce75442
fb255b7
ce75442
fb255b7
ce75442
 
 
 
 
fb255b7
ce75442
fb255b7
 
 
 
 
ce75442
fb255b7
ce75442
fb255b7
 
 
ce75442
fb255b7
ce75442
 
 
 
fb255b7
ce75442
 
fb255b7
 
 
ce75442
fb255b7
ce75442
 
 
fb255b7
 
 
ce75442
 
fb255b7
ce75442
fb255b7
ce75442
fb255b7
ce75442
fb255b7
 
ce75442
fb255b7
 
 
ce75442
fb255b7
ce75442
 
 
 
fb255b7
ce75442
 
 
fb255b7
 
 
 
 
ce75442
fb255b7
ce75442
 
 
fb255b7
 
ce75442
fb255b7
 
 
 
ce75442
 
fb255b7
ce75442
fb255b7
ce75442
 
 
fb255b7
 
 
 
 
ce75442
fb255b7
ce75442
fb255b7
ce75442
fb255b7
 
ce75442
 
fb255b7
 
 
 
ce75442
fb255b7
ce75442
fb255b7
ce75442
 
 
fb255b7
ce75442
fb255b7
ce75442
fb255b7
 
 
 
 
 
ce75442
fb255b7
ce75442
 
 
fb255b7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

---
library_name: transformers
tags: [text-classification, llm, huggingface, nlp, news, fine-tuning, gradio]
---

# 📰 NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning

A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup.

---

## Model Details

### Model Description

This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from [NPR](https://www.npr.org/). The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams.

- **Developed by:** Manan Gulati
- **Model type:** Transformer (text classification)
- **Language(s):** English
- **License:** MIT
- **Fine-tuned from model:** distilbert-base-uncased

### Model Sources

- **Repository:** https://github.com/mgulati3/Fine-Tune
- **Demo:** https://huggingface.co/spaces/mgulati3/news-classifier-ui
- **Model Hub:** https://huggingface.co/mgulati3/news-classifier-model

---

## Uses

### Direct Use
This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles.

### Out-of-Scope Use
- Not suitable for multi-label classification.
- Not recommended for non-news or informal text.
- May not perform well on non-English content.

---

## Bias, Risks, and Limitations

- The model is trained only on NPR articles, which may carry source-specific bias.
- Categories are limited to five; nuanced topics may not be accurately captured.
- Misclassifications may occur for ambiguous or mixed-topic content.

### Recommendations
Use prediction confidence scores to interpret results. Consider human review for sensitive applications.

---

## How to Get Started

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="mgulati3/news-classifier-model")
classifier("NASA's new moon mission will use AI to optimize fuel consumption.")
```

---

## Training Details

### Training Data
Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas.

### Training Procedure

- Tokenizer: LLaMA-compatible tokenizer
- Preprocessing: Lowercasing, truncation, padding
- Epochs: 4
- Optimizer: AdamW
- Batch size: 16

---

## Evaluation

### Testing Data
20% of the dataset was reserved for testing. Random stratified split was used.

### Metrics
- Accuracy (Train): 85%
- Accuracy (Test): 60%
- Metric: Accuracy (single-label, top-1)

### Results
The model performs well on domain-specific, labeled news content with distinguishable category patterns.

---

## Environmental Impact

- **Hardware Type:** Google Colab GPU (T4)
- **Hours used:** ~2.5
- **Cloud Provider:** Google
- **Compute Region:** US
- **Carbon Emitted:** Estimated ~0.2 kgCO2eq

---

## Technical Specifications

### Model Architecture
DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories.

### Compute Infrastructure
- Google Colab Pro
- Python 3.10
- Hugging Face Transformers 4.x
- PyTorch backend

---

## Citation

**APA:**

Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model

**BibTeX:**

@misc{gulati2025newssense,
  author = {Gulati, Manan},
  title = {NewsSense AI: Fine-tuned LLM for News Classification},
  year = {2025},
  url = {https://huggingface.co/mgulati3/news-classifier-model}
}

---

## Model Card Contact

For questions or collaborations: mgulati3@asu.edu