News Topic Classifier
Model Description
A news headline classifier that categorizes text into three topics: Technology, Sports, and Politics. Built using a linear probing approach with frozen BERT embeddings, achieving 95.24% test accuracy.
Model Details
- Base Model: bert-base-uncased
- Model Type: BERT with Linear Classification Head
- Training Approach: Linear Probing (frozen BERT encoder + trainable classifier)
- Language: English
- License: Apache 2.0
- Parameters: ~110M (frozen) + 2.3K (trainable)
Architecture
Input β Tokenizer β Frozen BERT Encoder β [CLS] Token β Linear Layer β 3 Classes
- Frozen Layers: All 12 BERT transformer layers
- Trainable Layers: Single linear classifier (768 β 3)
- Max Sequence Length: 128 tokens
- Device: CPU
Performance
Best Results
- Test Accuracy: 95.24%
- Best Epoch: 5/10
Training Progress
| Epoch | Train Loss | Train Acc | Test Loss | Test Acc |
|---|---|---|---|---|
| 1 | 1.0584 | 45.24% | 0.8724 | 80.95% |
| 2 | 0.8449 | 71.43% | 0.7716 | 90.48% |
| 3 | 0.6893 | 90.48% | 0.6514 | 80.95% |
| 4 | 0.5194 | 92.86% | 0.5837 | 90.48% |
| 5 | 0.4683 | 88.10% | 0.5102 | 95.24% β |
| 6 | 0.3987 | 95.24% | 0.4573 | 95.24% |
| 7 | 0.3671 | 95.24% | 0.4358 | 95.24% |
| 8 | 0.3386 | 97.62% | 0.3938 | 90.48% |
| 9 | 0.2763 | 97.62% | 0.3787 | 85.71% |
| 10 | 0.2724 | 94.05% | 0.3713 | 90.48% |
Classification Report (Final Evaluation - Epoch 10)
precision recall f1-score support
Technology 0.78 1.00 0.88 7 Sports 1.00 0.67 0.80 6 Politics 1.00 1.00 1.00 8
accuracy 0.90 21 macro avg 0.93 0.89 0.89 21 weightedavg 0.93 0.90 0.90 21
Class-wise Performance:
- Technology: 78% precision, 100% recall, 88% F1-score
- Sports: 100% precision, 67% recall, 80% F1-score
- Politics: Perfect performance (100% across all metrics)
Training Data
- Dataset Size: 105 samples
- Split: 84 train / 21 test (80/20)
- Class Distribution: Balanced (35 samples per class)
- Classes:
- 0: Technology
- 1: Sports
- 2: Politics
Training Details
Hyperparameters
- Learning Rate: 2e-3 (higher for linear probing)
- Optimizer: AdamW (classifier parameters only)
- Loss Function: CrossEntropyLoss
- Batch Size: 16
- Epochs: 10
- Max Length: 128 tokens
- Training Time: ~2.3 seconds per batch (CPU)
Training Strategy
Linear probing was chosen to:
- Leverage pre-trained BERT knowledge
- Reduce training time and compute requirements
- Prevent overfitting on small dataset (105 samples)
- Train only 2.3K parameters instead of 110M
Usage
Loading the Model
import torch from transformers import AutoTokenizer from model import BERTLinearClassifier # Your custom class
Load model model = BERTLinearClassifier(model_name='bert-base-uncased', num_labels=3) checkpoint = torch.load('pytorch_model.pt', map_location='cpu') model.load_state_dict(checkpoint['model_state_dict']) model.eval()
Load tokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
Making Predictions
Single prediction headline = "Apple releases new MacBook Pro with M3 chip"
inputs = tokenizer( headline, return_tensors='pt', max_length=128, padding='max_length', truncation=True )
with torch.no_grad(): logits = model(inputs['input_ids'], inputs['attention_mask']) prediction = torch.argmax(logits, dim=1).item() probabilities = torch.softmax(logits, dim=1)
Map to labels id2label = {0: 'Technology', 1: 'Sports', 2: 'Politics'} print(f"Predicted: {id2label[prediction]}") print(f"Confidence: {probabilities[prediction]:.2%}")
Output: Predicted: Technology Confidence: 94.5%
Batch Predictions
headlines = [ "Google unveils new AI model", "Manchester United wins Premier League", "Senate passes infrastructure bill" ]
for headline in headlines: inputs = tokenizer(headline, return_tensors='pt', max_length=128, padding='max_length', truncation=True)
with torch.no_grad(): logits = model(inputs['input_ids'], inputs['attention_mask']) pred = torch.argmax(logits, dim=1).item()
print(f"{headline} β {id2label[pred]}")
Limitations
- Small Training Set: Only 105 samples; may not generalize to diverse news sources
- Short Text Only: Optimized for headlines, not full articles
- Three Categories: Limited domain coverage
- English Only: No multilingual support
- Sports Recall: Lower recall (67%) on sports headlines - may miss some sports content
- CPU Training: Trained on CPU, so no GPU optimizations
Bias and Ethical Considerations
- Model may reflect biases in training data
- Limited to three broad categories; many news topics won't fit
- Should not be used for content moderation without human review
- Performance may vary on news from different time periods or regions
Intended Use
β Recommended Use Cases
- Educational demonstration of linear probing technique
- Portfolio project showcase
- Prototyping news classification pipelines
- Research on transfer learning with limited data
β Not Recommended For
- Production news classification systems (needs more data)
- Multi-label classification (politics + technology articles)
- Non-English content
- Long-form article classification
- High-stakes automated decision making
Future Improvements
- Expand dataset size (1000+ samples minimum)
- Add more categories (Business, Entertainment, Health, etc.)
- Fine-tune BERT layers for better performance
- Collect real-world news data from multiple sources
- Implement confidence thresholds for uncertain predictions
Model Files
pytorch_model.pt: Model checkpoint with state dictconfig.json: Model configuration and label mappingstokenizer files: BERT tokenizer (from bert-base-uncased)
Acknowledgments
- Base Model: bert-base-uncased by Google Research
- Framework: PyTorch
- Transformers Library: Hugging Face Transformers
Contact
- Hugging Face: karthik-infobell25
Note: This model was trained as a learning project. For production use, consider models trained on larger, more diverse datasets.
- Downloads last month
- 1
Model tree for karthik-infobell25/news-topic-classifier
Base model
google-bert/bert-base-uncased