--- library_name: transformers tags: [text-classification, llm, huggingface, nlp, news, fine-tuning, gradio] --- # 📰 NewsSense AI: LLM News Classifier with Web Scraping & Fine-Tuning A fine-tuned transformer-based model that classifies news articles into five functional categories: Politics, Business, Health, Science, and Climate. The dataset was scraped from NPR using Decodo and processed with BeautifulSoup. --- ## Model Details ### Model Description This model is fine-tuned using Hugging Face Transformers on a custom dataset of 5,000 news articles scraped directly from [NPR](https://www.npr.org/). The goal is to classify real-world news into practical categories for use in filtering, organizing, and summarizing large-scale news streams. - **Developed by:** Manan Gulati - **Model type:** Transformer (text classification) - **Language(s):** English - **License:** MIT - **Fine-tuned from model:** distilbert-base-uncased ### Model Sources - **Repository:** https://github.com/mgulati3/Fine-Tune - **Demo:** https://huggingface.co/spaces/mgulati3/news-classifier-ui - **Model Hub:** https://huggingface.co/mgulati3/news-classifier-model --- ## Uses ### Direct Use This model can be used to classify any English-language news article or paragraph into one of five categories. It's useful for content filtering, feed curation, and auto-tagging of articles. ### Out-of-Scope Use - Not suitable for multi-label classification. - Not recommended for non-news or informal text. - May not perform well on non-English content. --- ## Bias, Risks, and Limitations - The model is trained only on NPR articles, which may carry source-specific bias. - Categories are limited to five; nuanced topics may not be accurately captured. - Misclassifications may occur for ambiguous or mixed-topic content. ### Recommendations Use prediction confidence scores to interpret results. Consider human review for sensitive applications. --- ## How to Get Started ```python from transformers import pipeline classifier = pipeline("text-classification", model="mgulati3/news-classifier-model") classifier("NASA's new moon mission will use AI to optimize fuel consumption.") ``` --- ## Training Details ### Training Data Scraped 5,000 articles from NPR using Decodo (with proxy rotation and JS rendering). Articles were cleaned and labeled across five categories using Python and pandas. ### Training Procedure - Tokenizer: LLaMA-compatible tokenizer - Preprocessing: Lowercasing, truncation, padding - Epochs: 4 - Optimizer: AdamW - Batch size: 16 --- ## Evaluation ### Testing Data 20% of the dataset was reserved for testing. Random stratified split was used. ### Metrics - Accuracy (Train): 85% - Accuracy (Test): 60% - Metric: Accuracy (single-label, top-1) ### Results The model performs well on domain-specific, labeled news content with distinguishable category patterns. --- ## Environmental Impact - **Hardware Type:** Google Colab GPU (T4) - **Hours used:** ~2.5 - **Cloud Provider:** Google - **Compute Region:** US - **Carbon Emitted:** Estimated ~0.2 kgCO2eq --- ## Technical Specifications ### Model Architecture DistilBERT architecture fine-tuned for single-label text classification using a softmax output layer over 5 categories. ### Compute Infrastructure - Google Colab Pro - Python 3.10 - Hugging Face Transformers 4.x - PyTorch backend --- ## Citation **APA:** Gulati, M. (2025). NewsSense AI: Fine-tuned LLM for News Classification. https://huggingface.co/mgulati3/news-classifier-model **BibTeX:** @misc{gulati2025newssense, author = {Gulati, Manan}, title = {NewsSense AI: Fine-tuned LLM for News Classification}, year = {2025}, url = {https://huggingface.co/mgulati3/news-classifier-model} } --- ## Model Card Contact For questions or collaborations: mgulati3@asu.edu