Update README.md

fb255b7 verified 8 months ago

3.84 kB


	---
	library_name: transformers
	tags: [text-classification, llm, huggingface, nlp, news, fine-tuning, gradio]
	---

	# 📰 NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning

	A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup.

	---

	## Model Details

	### Model Description

	This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from [NPR](https://www.npr.org/). The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams.

	- Developed by: Manan Gulati
	- Model type: Transformer (text classification)
	- Language(s): English
	- License: MIT
	- Fine-tuned from model: distilbert-base-uncased

	### Model Sources

	- Repository: https://github.com/mgulati3/Fine-Tune
	- Demo: https://huggingface.co/spaces/mgulati3/news-classifier-ui
	- Model Hub: https://huggingface.co/mgulati3/news-classifier-model

	---

	## Uses

	### Direct Use
	This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles.

	### Out-of-Scope Use
	- Not suitable for multi-label classification.
	- Not recommended for non-news or informal text.
	- May not perform well on non-English content.

	---

	## Bias, Risks, and Limitations

	- The model is trained only on NPR articles, which may carry source-specific bias.
	- Categories are limited to five; nuanced topics may not be accurately captured.
	- Misclassifications may occur for ambiguous or mixed-topic content.

	### Recommendations
	Use prediction confidence scores to interpret results. Consider human review for sensitive applications.

	---

	## How to Get Started

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="mgulati3/news-classifier-model")
	classifier("NASA's new moon mission will use AI to optimize fuel consumption.")
	```

	---

	## Training Details

	### Training Data
	Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas.

	### Training Procedure

	- Tokenizer: LLaMA-compatible tokenizer
	- Preprocessing: Lowercasing, truncation, padding
	- Epochs: 4
	- Optimizer: AdamW
	- Batch size: 16

	---

	## Evaluation

	### Testing Data
	20% of the dataset was reserved for testing. Random stratified split was used.

	### Metrics
	- Accuracy (Train): 85%
	- Accuracy (Test): 60%
	- Metric: Accuracy (single-label, top-1)

	### Results
	The model performs well on domain-specific, labeled news content with distinguishable category patterns.

	---

	## Environmental Impact

	- Hardware Type: Google Colab GPU (T4)
	- Hours used: ~2.5
	- Cloud Provider: Google
	- Compute Region: US
	- Carbon Emitted: Estimated ~0.2 kgCO2eq

	---

	## Technical Specifications

	### Model Architecture
	DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories.

	### Compute Infrastructure
	- Google Colab Pro
	- Python 3.10
	- Hugging Face Transformers 4.x
	- PyTorch backend

	---

	## Citation

	APA:

	Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model

	BibTeX:

	@misc{gulati2025newssense,
	author = {Gulati, Manan},
	title = {NewsSense AI: Fine-tuned LLM for News Classification},
	year = {2025},
	url = {https://huggingface.co/mgulati3/news-classifier-model}
	}

	---

	## Model Card Contact

	For questions or collaborations: mgulati3@asu.edu