| |
|
| | --- |
| | library_name: transformers |
| | tags: [text-classification, llm, huggingface, nlp, news, fine-tuning, gradio] |
| | --- |
| | |
| | # 📰 NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning |
| |
|
| | A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup. |
| |
|
| | --- |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from [NPR](https://www.npr.org/). The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams. |
| |
|
| | - **Developed by:** Manan Gulati |
| | - **Model type:** Transformer (text classification) |
| | - **Language(s):** English |
| | - **License:** MIT |
| | - **Fine-tuned from model:** distilbert-base-uncased |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** https://github.com/mgulati3/Fine-Tune |
| | - **Demo:** https://huggingface.co/spaces/mgulati3/news-classifier-ui |
| | - **Model Hub:** https://huggingface.co/mgulati3/news-classifier-model |
| |
|
| | --- |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| | This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles. |
| |
|
| | ### Out-of-Scope Use |
| | - Not suitable for multi-label classification. |
| | - Not recommended for non-news or informal text. |
| | - May not perform well on non-English content. |
| |
|
| | --- |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | - The model is trained only on NPR articles, which may carry source-specific bias. |
| | - Categories are limited to five; nuanced topics may not be accurately captured. |
| | - Misclassifications may occur for ambiguous or mixed-topic content. |
| |
|
| | ### Recommendations |
| | Use prediction confidence scores to interpret results. Consider human review for sensitive applications. |
| |
|
| | --- |
| |
|
| | ## How to Get Started |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | classifier = pipeline("text-classification", model="mgulati3/news-classifier-model") |
| | classifier("NASA's new moon mission will use AI to optimize fuel consumption.") |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| | Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas. |
| |
|
| | ### Training Procedure |
| |
|
| | - Tokenizer: LLaMA-compatible tokenizer |
| | - Preprocessing: Lowercasing, truncation, padding |
| | - Epochs: 4 |
| | - Optimizer: AdamW |
| | - Batch size: 16 |
| |
|
| | --- |
| |
|
| | ## Evaluation |
| |
|
| | ### Testing Data |
| | 20% of the dataset was reserved for testing. Random stratified split was used. |
| |
|
| | ### Metrics |
| | - Accuracy (Train): 85% |
| | - Accuracy (Test): 60% |
| | - Metric: Accuracy (single-label, top-1) |
| |
|
| | ### Results |
| | The model performs well on domain-specific, labeled news content with distinguishable category patterns. |
| |
|
| | --- |
| |
|
| | ## Environmental Impact |
| |
|
| | - **Hardware Type:** Google Colab GPU (T4) |
| | - **Hours used:** ~2.5 |
| | - **Cloud Provider:** Google |
| | - **Compute Region:** US |
| | - **Carbon Emitted:** Estimated ~0.2 kgCO2eq |
| |
|
| | --- |
| |
|
| | ## Technical Specifications |
| |
|
| | ### Model Architecture |
| | DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories. |
| |
|
| | ### Compute Infrastructure |
| | - Google Colab Pro |
| | - Python 3.10 |
| | - Hugging Face Transformers 4.x |
| | - PyTorch backend |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | **APA:** |
| |
|
| | Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model |
| |
|
| | **BibTeX:** |
| |
|
| | @misc{gulati2025newssense, |
| | author = {Gulati, Manan}, |
| | title = {NewsSense AI: Fine-tuned LLM for News Classification}, |
| | year = {2025}, |
| | url = {https://huggingface.co/mgulati3/news-classifier-model} |
| | } |
| |
|
| | --- |
| |
|
| | ## Model Card Contact |
| |
|
| | For questions or collaborations: mgulati3@asu.edu |
| |
|