File size: 2,611 Bytes
a1459dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- roberta
- clickbait
- clickbait-detection
- moderation
- content-moderation
datasets:
- christinacdl/Clickbait_New
- marksverdhei/clickbait_title_classification
- contemmcm/clickbait
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: text-classification
---

# 🎯 RoBERTa Clickbait Classifier

A clickbait detection model built on **RoBERTa-base** (125M parameters), fine-tuned on multiple combined and deduplicated English datasets.

## πŸš€ Quick Start

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="ENTUM-AI/roberta-clickbait-classifier")

# Clickbait
result = classifier("You Won't BELIEVE What This Celebrity Did Next!")
print(result)  # [{'label': 'Clickbait', 'score': 0.99...}]

# Non-Clickbait
result = classifier("Federal Reserve raises interest rates by 0.25 percentage points")
print(result)  # [{'label': 'Non-Clickbait', 'score': 0.99...}]
```

## Model Details

| | |
|---|---|
| **Architecture** | RoBERTa-base (125M parameters) |
| **Task** | Binary text classification |
| **Labels** | `Clickbait` (1), `Non-Clickbait` (0) |
| **Language** | English |
| **License** | Apache 2.0 |
| **Max input length** | 128 tokens |

## πŸ“Š Training Data

Three public English clickbait datasets, combined and deduplicated:

| Dataset | Source |
|---------|--------|
| [christinacdl/Clickbait_New](https://huggingface.co/datasets/christinacdl/Clickbait_New) | 58.6K samples from multiple sources |
| [marksverdhei/clickbait_title_classification](https://huggingface.co/datasets/marksverdhei/clickbait_title_classification) | 32K samples (Chakraborty et al., ASONAM 2016) |
| [contemmcm/clickbait](https://huggingface.co/datasets/contemmcm/clickbait) | 26K samples |

After deduplication and balancing: **~48K samples** (train/val/test split 85/10/5).

## βš™οΈ Training

Fine-tuned with HuggingFace Trainer using linear LR schedule with warmup, AdamW optimizer, and early stopping on F1 score.

## πŸ’‘ Use Cases

- **News aggregators** β€” filter low-quality clickbait articles
- **Social media** β€” content moderation and feed quality scoring
- **Browser extensions** β€” warn users about clickbait headlines
- **Email filters** β€” detect clickbait-style subject lines
- **Content platforms** β€” automated content quality assessment

## ⚠️ Limitations

- English only
- Optimized for short texts (headlines, titles, tweets); longer texts will be truncated to 128 tokens
- Reflects patterns and biases present in the training data sources