| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - text-classification |
| - roberta |
| - clickbait |
| - clickbait-detection |
| - moderation |
| - content-moderation |
| datasets: |
| - christinacdl/Clickbait_New |
| - marksverdhei/clickbait_title_classification |
| - contemmcm/clickbait |
| metrics: |
| - accuracy |
| - f1 |
| - precision |
| - recall |
| pipeline_tag: text-classification |
| --- |
| |
| # π― RoBERTa Clickbait Classifier |
|
|
| A clickbait detection model built on **RoBERTa-base** (125M parameters), fine-tuned on multiple combined and deduplicated English datasets. |
|
|
| ## π Quick Start |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline("text-classification", model="ENTUM-AI/roberta-clickbait-classifier") |
| |
| # Clickbait |
| result = classifier("You Won't BELIEVE What This Celebrity Did Next!") |
| print(result) # [{'label': 'Clickbait', 'score': 0.99...}] |
| |
| # Non-Clickbait |
| result = classifier("Federal Reserve raises interest rates by 0.25 percentage points") |
| print(result) # [{'label': 'Non-Clickbait', 'score': 0.99...}] |
| ``` |
|
|
| ## Model Details |
|
|
| | | | |
| |---|---| |
| | **Architecture** | RoBERTa-base (125M parameters) | |
| | **Task** | Binary text classification | |
| | **Labels** | `Clickbait` (1), `Non-Clickbait` (0) | |
| | **Language** | English | |
| | **License** | Apache 2.0 | |
| | **Max input length** | 128 tokens | |
|
|
| ## π Training Data |
|
|
| Three public English clickbait datasets, combined and deduplicated: |
|
|
| | Dataset | Source | |
| |---------|--------| |
| | [christinacdl/Clickbait_New](https://huggingface.co/datasets/christinacdl/Clickbait_New) | 58.6K samples from multiple sources | |
| | [marksverdhei/clickbait_title_classification](https://huggingface.co/datasets/marksverdhei/clickbait_title_classification) | 32K samples (Chakraborty et al., ASONAM 2016) | |
| | [contemmcm/clickbait](https://huggingface.co/datasets/contemmcm/clickbait) | 26K samples | |
|
|
| After deduplication and balancing: **~48K samples** (train/val/test split 85/10/5). |
|
|
| ## βοΈ Training |
|
|
| Fine-tuned with HuggingFace Trainer using linear LR schedule with warmup, AdamW optimizer, and early stopping on F1 score. |
|
|
| ## π‘ Use Cases |
|
|
| - **News aggregators** β filter low-quality clickbait articles |
| - **Social media** β content moderation and feed quality scoring |
| - **Browser extensions** β warn users about clickbait headlines |
| - **Email filters** β detect clickbait-style subject lines |
| - **Content platforms** β automated content quality assessment |
|
|
| ## β οΈ Limitations |
|
|
| - English only |
| - Optimized for short texts (headlines, titles, tweets); longer texts will be truncated to 128 tokens |
| - Reflects patterns and biases present in the training data sources |
|
|