File size: 2,575 Bytes

27f183f

---
license: mit
language:
- en
tags:
- text-classification
- distilbert
- advertisement-detection
- rss
- news
- binary-classification
pipeline_tag: text-classification
---

# DistilBERT RSS Advertisement Detection

A DistilBERT-based model for classifying RSS article titles as advertisements or legitimate news content.

## Model Description

This model is fine-tuned from `distilbert-base-uncased` for binary text classification. It can distinguish between:
- **Advertisement**: Promotional content, deals, sales, sponsored content
- **News**: Legitimate news articles, editorial content, research findings

## Intended Use

- **Primary**: Filtering RSS feeds to separate advertisements from news
- **Secondary**: Content moderation, spam detection, content categorization
- **Research**: Text classification, advertisement detection studies

## Performance

- **Accuracy**: ~95%
- **F1 Score**: ~94%
- **Precision**: ~93%
- **Recall**: ~94%

## Training Data

- **Source**: 75+ RSS feeds from major tech news outlets
- **Articles**: 1,600+ RSS articles
- **Labeled**: 1,000+ manually labeled examples
- **Sources**: TechCrunch, WIRED, The Verge, Ars Technica, OpenAI, Google AI, etc.

## Usage

```python
from transformers import pipeline

# Load the model
classifier = pipeline("text-classification", 
                     model="SoroushXYZ/distilbert-rss-ad-detection")

# Classify examples
examples = [
    "Apple Announces New iPhone with Advanced AI Features",
    "50% OFF - Limited Time Offer on Premium Headphones!",
    "Scientists Discover New Method for Carbon Capture",
    "Buy Now! Get Free Shipping on All Electronics Today Only!"
]

for text in examples:
    result = classifier(text)
    print(f"{text} -> {result[0]['label']} ({result[0]['score']:.3f})")
```

## Model Architecture

- **Base Model**: distilbert-base-uncased
- **Task**: Binary text classification
- **Input**: Text (max 128 tokens)
- **Output**: Class probabilities (news, advertisement)

## Training Details

- **Epochs**: 3
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Optimizer**: AdamW
- **Framework**: PyTorch + Transformers

## Limitations

- Trained primarily on tech news content
- May not generalize well to other domains
- Performance depends on title quality and clarity
- Limited to English language content

## Citation

If you use this model, please cite:

```bibtex
@misc{distilbert-rss-ad-detection,
  title={DistilBERT RSS Advertisement Detection},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/SoroushXYZ/distilbert-rss-ad-detection}
}
```