File size: 2,575 Bytes
27f183f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | ---
license: mit
language:
- en
tags:
- text-classification
- distilbert
- advertisement-detection
- rss
- news
- binary-classification
pipeline_tag: text-classification
---
# DistilBERT RSS Advertisement Detection
A DistilBERT-based model for classifying RSS article titles as advertisements or legitimate news content.
## Model Description
This model is fine-tuned from `distilbert-base-uncased` for binary text classification. It can distinguish between:
- **Advertisement**: Promotional content, deals, sales, sponsored content
- **News**: Legitimate news articles, editorial content, research findings
## Intended Use
- **Primary**: Filtering RSS feeds to separate advertisements from news
- **Secondary**: Content moderation, spam detection, content categorization
- **Research**: Text classification, advertisement detection studies
## Performance
- **Accuracy**: ~95%
- **F1 Score**: ~94%
- **Precision**: ~93%
- **Recall**: ~94%
## Training Data
- **Source**: 75+ RSS feeds from major tech news outlets
- **Articles**: 1,600+ RSS articles
- **Labeled**: 1,000+ manually labeled examples
- **Sources**: TechCrunch, WIRED, The Verge, Ars Technica, OpenAI, Google AI, etc.
## Usage
```python
from transformers import pipeline
# Load the model
classifier = pipeline("text-classification",
model="SoroushXYZ/distilbert-rss-ad-detection")
# Classify examples
examples = [
"Apple Announces New iPhone with Advanced AI Features",
"50% OFF - Limited Time Offer on Premium Headphones!",
"Scientists Discover New Method for Carbon Capture",
"Buy Now! Get Free Shipping on All Electronics Today Only!"
]
for text in examples:
result = classifier(text)
print(f"{text} -> {result[0]['label']} ({result[0]['score']:.3f})")
```
## Model Architecture
- **Base Model**: distilbert-base-uncased
- **Task**: Binary text classification
- **Input**: Text (max 128 tokens)
- **Output**: Class probabilities (news, advertisement)
## Training Details
- **Epochs**: 3
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Optimizer**: AdamW
- **Framework**: PyTorch + Transformers
## Limitations
- Trained primarily on tech news content
- May not generalize well to other domains
- Performance depends on title quality and clarity
- Limited to English language content
## Citation
If you use this model, please cite:
```bibtex
@misc{distilbert-rss-ad-detection,
title={DistilBERT RSS Advertisement Detection},
author={Your Name},
year={2024},
url={https://huggingface.co/SoroushXYZ/distilbert-rss-ad-detection}
}
```
|