|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-classification |
|
|
- distilbert |
|
|
- advertisement-detection |
|
|
- rss |
|
|
- news |
|
|
- binary-classification |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# DistilBERT RSS Advertisement Detection |
|
|
|
|
|
A DistilBERT-based model for classifying RSS article titles as advertisements or legitimate news content. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is fine-tuned from `distilbert-base-uncased` for binary text classification. It can distinguish between: |
|
|
- **Advertisement**: Promotional content, deals, sales, sponsored content |
|
|
- **News**: Legitimate news articles, editorial content, research findings |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- **Primary**: Filtering RSS feeds to separate advertisements from news |
|
|
- **Secondary**: Content moderation, spam detection, content categorization |
|
|
- **Research**: Text classification, advertisement detection studies |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Accuracy**: ~95% |
|
|
- **F1 Score**: ~94% |
|
|
- **Precision**: ~93% |
|
|
- **Recall**: ~94% |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Source**: 75+ RSS feeds from major tech news outlets |
|
|
- **Articles**: 1,600+ RSS articles |
|
|
- **Labeled**: 1,000+ manually labeled examples |
|
|
- **Sources**: TechCrunch, WIRED, The Verge, Ars Technica, OpenAI, Google AI, etc. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
classifier = pipeline("text-classification", |
|
|
model="SoroushXYZ/distilbert-rss-ad-detection") |
|
|
|
|
|
# Classify examples |
|
|
examples = [ |
|
|
"Apple Announces New iPhone with Advanced AI Features", |
|
|
"50% OFF - Limited Time Offer on Premium Headphones!", |
|
|
"Scientists Discover New Method for Carbon Capture", |
|
|
"Buy Now! Get Free Shipping on All Electronics Today Only!" |
|
|
] |
|
|
|
|
|
for text in examples: |
|
|
result = classifier(text) |
|
|
print(f"{text} -> {result[0]['label']} ({result[0]['score']:.3f})") |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Base Model**: distilbert-base-uncased |
|
|
- **Task**: Binary text classification |
|
|
- **Input**: Text (max 128 tokens) |
|
|
- **Output**: Class probabilities (news, advertisement) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Epochs**: 3 |
|
|
- **Batch Size**: 16 |
|
|
- **Learning Rate**: 5e-5 |
|
|
- **Optimizer**: AdamW |
|
|
- **Framework**: PyTorch + Transformers |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained primarily on tech news content |
|
|
- May not generalize well to other domains |
|
|
- Performance depends on title quality and clarity |
|
|
- Limited to English language content |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{distilbert-rss-ad-detection, |
|
|
title={DistilBERT RSS Advertisement Detection}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/SoroushXYZ/distilbert-rss-ad-detection} |
|
|
} |
|
|
``` |
|
|
|