Instructions to use BluSerK/news-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- spaCy
How to use BluSerK/news-classifier with spaCy:
!pip install https://huggingface.co/BluSerK/news-classifier/resolve/main/news-classifier-any-py3-none-any.whl # Using spacy.load(). import spacy nlp = spacy.load("news-classifier") # Importing as module. import news-classifier nlp = news-classifier.load() - Notebooks
- Google Colab
- Kaggle
BBC News Classification Pipeline (Production-Ready)
This repository hosts a production-optimized NLP pipeline that classifies news articles into five distinct categories: Business, Entertainment, Politics, Tech, and Sports.
Unlike standard modeling workflows that separate text processing from inference, this model encapsulates its entire custom preprocessing architecture inside a single serialized pipeline to completely eliminate training-serving skew.
π Model Performance & Accuracy
The model was evaluated on a 20% holdout test set, utilizing strict class stratification to ensure balanced evaluation across all five news categories.
- Macro-F1 Score: 0.96
- Evaluation Metrics: Achieved highly balanced Macro-Precision and Macro-Recall scores, ensuring that minority and majority classes are predicted with equal reliability.
- Efficiency: Achieves deep-learning-level accuracy via highly optimized feature engineering, drastically reducing inference compute costs compared to transformer models like BERT.
βοΈ Model Architecture & Design
The engineering design wraps all dependencies into a unified scikit-learn Pipeline:
- Custom NLP Transformer (
TextCleaner): Inherits fromBaseEstimatorandTransformerMixin. It executes regex-based cleaning (removal of HTML tags, URLs, email addresses, and line breaks) followed by a deterministic spaCy (en_core_web_sm) tokenization pass to strip stop words, punctuation, and capture structural base forms (lemmatization). - Feature Extraction: A
CountVectorizerconfiguration pulling both unigrams and bigrams (ngram_range=(1,2)) to preserve local multi-word semantic features. - Classifier Layer: A
MultinomialNB(Multinomial Naive Bayes) estimator optimized for discrete count-based document text frequencies.
[Input Text] ββ> [TextCleaner (spaCy)] ββ> [CountVectorizer (1,2 Grams)] ββ> [MultinomialNB] ββ> [Output Class]
- Downloads last month
- -