SoroushXYZ's picture
Initial model upload - DistilBERT RSS Advertisement Detection
27f183f verified
---
license: mit
language:
- en
tags:
- text-classification
- distilbert
- advertisement-detection
- rss
- news
- binary-classification
pipeline_tag: text-classification
---
# DistilBERT RSS Advertisement Detection
A DistilBERT-based model for classifying RSS article titles as advertisements or legitimate news content.
## Model Description
This model is fine-tuned from `distilbert-base-uncased` for binary text classification. It can distinguish between:
- **Advertisement**: Promotional content, deals, sales, sponsored content
- **News**: Legitimate news articles, editorial content, research findings
## Intended Use
- **Primary**: Filtering RSS feeds to separate advertisements from news
- **Secondary**: Content moderation, spam detection, content categorization
- **Research**: Text classification, advertisement detection studies
## Performance
- **Accuracy**: ~95%
- **F1 Score**: ~94%
- **Precision**: ~93%
- **Recall**: ~94%
## Training Data
- **Source**: 75+ RSS feeds from major tech news outlets
- **Articles**: 1,600+ RSS articles
- **Labeled**: 1,000+ manually labeled examples
- **Sources**: TechCrunch, WIRED, The Verge, Ars Technica, OpenAI, Google AI, etc.
## Usage
```python
from transformers import pipeline
# Load the model
classifier = pipeline("text-classification",
model="SoroushXYZ/distilbert-rss-ad-detection")
# Classify examples
examples = [
"Apple Announces New iPhone with Advanced AI Features",
"50% OFF - Limited Time Offer on Premium Headphones!",
"Scientists Discover New Method for Carbon Capture",
"Buy Now! Get Free Shipping on All Electronics Today Only!"
]
for text in examples:
result = classifier(text)
print(f"{text} -> {result[0]['label']} ({result[0]['score']:.3f})")
```
## Model Architecture
- **Base Model**: distilbert-base-uncased
- **Task**: Binary text classification
- **Input**: Text (max 128 tokens)
- **Output**: Class probabilities (news, advertisement)
## Training Details
- **Epochs**: 3
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Optimizer**: AdamW
- **Framework**: PyTorch + Transformers
## Limitations
- Trained primarily on tech news content
- May not generalize well to other domains
- Performance depends on title quality and clarity
- Limited to English language content
## Citation
If you use this model, please cite:
```bibtex
@misc{distilbert-rss-ad-detection,
title={DistilBERT RSS Advertisement Detection},
author={Your Name},
year={2024},
url={https://huggingface.co/SoroushXYZ/distilbert-rss-ad-detection}
}
```