Initial model upload - DistilBERT RSS Advertisement Detection

27f183f verified 5 months ago

2.58 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-classification
	- distilbert
	- advertisement-detection
	- rss
	- news
	- binary-classification
	pipeline_tag: text-classification
	---

	# DistilBERT RSS Advertisement Detection

	A DistilBERT-based model for classifying RSS article titles as advertisements or legitimate news content.

	## Model Description

	This model is fine-tuned from `distilbert-base-uncased` for binary text classification. It can distinguish between:
	- Advertisement: Promotional content, deals, sales, sponsored content
	- News: Legitimate news articles, editorial content, research findings

	## Intended Use

	- Primary: Filtering RSS feeds to separate advertisements from news
	- Secondary: Content moderation, spam detection, content categorization
	- Research: Text classification, advertisement detection studies

	## Performance

	- Accuracy: ~95%
	- F1 Score: ~94%
	- Precision: ~93%
	- Recall: ~94%

	## Training Data

	- Source: 75+ RSS feeds from major tech news outlets
	- Articles: 1,600+ RSS articles
	- Labeled: 1,000+ manually labeled examples
	- Sources: TechCrunch, WIRED, The Verge, Ars Technica, OpenAI, Google AI, etc.

	## Usage

	```python
	from transformers import pipeline

	# Load the model
	classifier = pipeline("text-classification",
	model="SoroushXYZ/distilbert-rss-ad-detection")

	# Classify examples
	examples = [
	"Apple Announces New iPhone with Advanced AI Features",
	"50% OFF - Limited Time Offer on Premium Headphones!",
	"Scientists Discover New Method for Carbon Capture",
	"Buy Now! Get Free Shipping on All Electronics Today Only!"
	]

	for text in examples:
	result = classifier(text)
	print(f"{text} -> {result[0]['label']} ({result[0]['score']:.3f})")
	```

	## Model Architecture

	- Base Model: distilbert-base-uncased
	- Task: Binary text classification
	- Input: Text (max 128 tokens)
	- Output: Class probabilities (news, advertisement)

	## Training Details

	- Epochs: 3
	- Batch Size: 16
	- Learning Rate: 5e-5
	- Optimizer: AdamW
	- Framework: PyTorch + Transformers

	## Limitations

	- Trained primarily on tech news content
	- May not generalize well to other domains
	- Performance depends on title quality and clarity
	- Limited to English language content

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{distilbert-rss-ad-detection,
	title={DistilBERT RSS Advertisement Detection},
	author={Your Name},
	year={2024},
	url={https://huggingface.co/SoroushXYZ/distilbert-rss-ad-detection}
	}
	```