--- language: - en license: apache-2.0 library_name: transformers tags: - text-classification - roberta - clickbait - clickbait-detection - moderation - content-moderation datasets: - christinacdl/Clickbait_New - marksverdhei/clickbait_title_classification - contemmcm/clickbait metrics: - accuracy - f1 - precision - recall pipeline_tag: text-classification --- # 🎯 RoBERTa Clickbait Classifier A clickbait detection model built on **RoBERTa-base** (125M parameters), fine-tuned on multiple combined and deduplicated English datasets. ## 🚀 Quick Start ```python from transformers import pipeline classifier = pipeline("text-classification", model="ENTUM-AI/roberta-clickbait-classifier") # Clickbait result = classifier("You Won't BELIEVE What This Celebrity Did Next!") print(result) # [{'label': 'Clickbait', 'score': 0.99...}] # Non-Clickbait result = classifier("Federal Reserve raises interest rates by 0.25 percentage points") print(result) # [{'label': 'Non-Clickbait', 'score': 0.99...}] ``` ## Model Details | | | |---|---| | **Architecture** | RoBERTa-base (125M parameters) | | **Task** | Binary text classification | | **Labels** | `Clickbait` (1), `Non-Clickbait` (0) | | **Language** | English | | **License** | Apache 2.0 | | **Max input length** | 128 tokens | ## 📊 Training Data Three public English clickbait datasets, combined and deduplicated: | Dataset | Source | |---------|--------| | [christinacdl/Clickbait_New](https://huggingface.co/datasets/christinacdl/Clickbait_New) | 58.6K samples from multiple sources | | [marksverdhei/clickbait_title_classification](https://huggingface.co/datasets/marksverdhei/clickbait_title_classification) | 32K samples (Chakraborty et al., ASONAM 2016) | | [contemmcm/clickbait](https://huggingface.co/datasets/contemmcm/clickbait) | 26K samples | After deduplication and balancing: **~48K samples** (train/val/test split 85/10/5). ## ⚙️ Training Fine-tuned with HuggingFace Trainer using linear LR schedule with warmup, AdamW optimizer, and early stopping on F1 score. ## 💡 Use Cases - **News aggregators** — filter low-quality clickbait articles - **Social media** — content moderation and feed quality scoring - **Browser extensions** — warn users about clickbait headlines - **Email filters** — detect clickbait-style subject lines - **Content platforms** — automated content quality assessment ## ⚠️ Limitations - English only - Optimized for short texts (headlines, titles, tweets); longer texts will be truncated to 128 tokens - Reflects patterns and biases present in the training data sources