ENTUM-AI commited on
Commit
a1459dc
·
verified ·
1 Parent(s): a7f0341

Upload RoBERTa Clickbait Classifier

Browse files
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - text-classification
8
+ - roberta
9
+ - clickbait
10
+ - clickbait-detection
11
+ - moderation
12
+ - content-moderation
13
+ datasets:
14
+ - christinacdl/Clickbait_New
15
+ - marksverdhei/clickbait_title_classification
16
+ - contemmcm/clickbait
17
+ metrics:
18
+ - accuracy
19
+ - f1
20
+ - precision
21
+ - recall
22
+ pipeline_tag: text-classification
23
+ ---
24
+
25
+ # 🎯 RoBERTa Clickbait Classifier
26
+
27
+ A clickbait detection model built on **RoBERTa-base** (125M parameters), fine-tuned on multiple combined and deduplicated English datasets.
28
+
29
+ ## 🚀 Quick Start
30
+
31
+ ```python
32
+ from transformers import pipeline
33
+
34
+ classifier = pipeline("text-classification", model="ENTUM-AI/roberta-clickbait-classifier")
35
+
36
+ # Clickbait
37
+ result = classifier("You Won't BELIEVE What This Celebrity Did Next!")
38
+ print(result) # [{'label': 'Clickbait', 'score': 0.99...}]
39
+
40
+ # Non-Clickbait
41
+ result = classifier("Federal Reserve raises interest rates by 0.25 percentage points")
42
+ print(result) # [{'label': 'Non-Clickbait', 'score': 0.99...}]
43
+ ```
44
+
45
+ ## Model Details
46
+
47
+ | | |
48
+ |---|---|
49
+ | **Architecture** | RoBERTa-base (125M parameters) |
50
+ | **Task** | Binary text classification |
51
+ | **Labels** | `Clickbait` (1), `Non-Clickbait` (0) |
52
+ | **Language** | English |
53
+ | **License** | Apache 2.0 |
54
+ | **Max input length** | 128 tokens |
55
+
56
+ ## 📊 Training Data
57
+
58
+ Three public English clickbait datasets, combined and deduplicated:
59
+
60
+ | Dataset | Source |
61
+ |---------|--------|
62
+ | [christinacdl/Clickbait_New](https://huggingface.co/datasets/christinacdl/Clickbait_New) | 58.6K samples from multiple sources |
63
+ | [marksverdhei/clickbait_title_classification](https://huggingface.co/datasets/marksverdhei/clickbait_title_classification) | 32K samples (Chakraborty et al., ASONAM 2016) |
64
+ | [contemmcm/clickbait](https://huggingface.co/datasets/contemmcm/clickbait) | 26K samples |
65
+
66
+ After deduplication and balancing: **~48K samples** (train/val/test split 85/10/5).
67
+
68
+ ## ⚙️ Training
69
+
70
+ Fine-tuned with HuggingFace Trainer using linear LR schedule with warmup, AdamW optimizer, and early stopping on F1 score.
71
+
72
+ ## 💡 Use Cases
73
+
74
+ - **News aggregators** — filter low-quality clickbait articles
75
+ - **Social media** — content moderation and feed quality scoring
76
+ - **Browser extensions** — warn users about clickbait headlines
77
+ - **Email filters** — detect clickbait-style subject lines
78
+ - **Content platforms** — automated content quality assessment
79
+
80
+ ## ⚠️ Limitations
81
+
82
+ - English only
83
+ - Optimized for short texts (headlines, titles, tweets); longer texts will be truncated to 128 tokens
84
+ - Reflects patterns and biases present in the training data sources
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "Non-Clickbait",
16
+ "1": "Clickbait"
17
+ },
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 3072,
20
+ "is_decoder": false,
21
+ "label2id": {
22
+ "Clickbait": 1,
23
+ "Non-Clickbait": 0
24
+ },
25
+ "layer_norm_eps": 1e-05,
26
+ "max_position_embeddings": 514,
27
+ "model_type": "roberta",
28
+ "num_attention_heads": 12,
29
+ "num_hidden_layers": 12,
30
+ "pad_token_id": 1,
31
+ "problem_type": "single_label_classification",
32
+ "tie_word_embeddings": true,
33
+ "transformers_version": "5.1.0",
34
+ "type_vocab_size": 1,
35
+ "use_cache": false,
36
+ "vocab_size": 50265
37
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:217e1e1259a57f18f9e5558f0a064550c55aac544a622e4990660b6d1f6bf91f
3
+ size 498612800
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<s>",
5
+ "cls_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "errors": "replace",
8
+ "is_local": false,
9
+ "mask_token": "<mask>",
10
+ "model_max_length": 512,
11
+ "pad_token": "<pad>",
12
+ "sep_token": "</s>",
13
+ "tokenizer_class": "RobertaTokenizer",
14
+ "trim_offsets": true,
15
+ "unk_token": "<unk>"
16
+ }
training_results.json ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "roberta-base",
3
+ "training_config": {
4
+ "max_length": 128,
5
+ "batch_size": 16,
6
+ "grad_accum_steps": 4,
7
+ "effective_batch_size": 64,
8
+ "learning_rate": 2e-05,
9
+ "weight_decay": 0.01,
10
+ "warmup_ratio": 0.1,
11
+ "label_smoothing": 0.0,
12
+ "epochs_trained": 5,
13
+ "max_epochs": 5,
14
+ "early_stopping_patience": 2,
15
+ "seed": 42
16
+ },
17
+ "test_metrics": {
18
+ "loss": 0.1989,
19
+ "accuracy": 0.9215,
20
+ "f1": 0.9197,
21
+ "precision": 0.9431,
22
+ "recall": 0.8975
23
+ },
24
+ "training_log": [
25
+ {
26
+ "epoch": 1.0,
27
+ "eval_loss": 0.21930797398090363,
28
+ "eval_accuracy": 0.9154668860551214,
29
+ "eval_f1": 0.9150650960942344,
30
+ "eval_precision": 0.9275240888144114,
31
+ "eval_recall": 0.9029363784665579
32
+ },
33
+ {
34
+ "epoch": 2.0,
35
+ "eval_loss": 0.21582643687725067,
36
+ "eval_accuracy": 0.9164952694364459,
37
+ "eval_f1": 0.9156626506024096,
38
+ "eval_precision": 0.9331075359864521,
39
+ "eval_recall": 0.8988580750407831
40
+ },
41
+ {
42
+ "epoch": 3.0,
43
+ "eval_loss": 0.22042229771614075,
44
+ "eval_accuracy": 0.9127930892636775,
45
+ "eval_f1": 0.9140308191403081,
46
+ "eval_precision": 0.9088709677419354,
47
+ "eval_recall": 0.9192495921696574
48
+ },
49
+ {
50
+ "epoch": 4.0,
51
+ "eval_loss": 0.2514384686946869,
52
+ "eval_accuracy": 0.9127930892636775,
53
+ "eval_f1": 0.9135752140236445,
54
+ "eval_precision": 0.91320293398533,
55
+ "eval_recall": 0.9139477977161501
56
+ },
57
+ {
58
+ "epoch": 4.0,
59
+ "eval_loss": 0.1989288628101349,
60
+ "eval_accuracy": 0.9214638157894737,
61
+ "eval_f1": 0.919714165615805,
62
+ "eval_precision": 0.9431034482758621,
63
+ "eval_recall": 0.8974569319114027
64
+ }
65
+ ],
66
+ "confusion_matrix": [
67
+ [
68
+ 1147,
69
+ 66
70
+ ],
71
+ [
72
+ 125,
73
+ 1094
74
+ ]
75
+ ],
76
+ "training_time_minutes": 15.3,
77
+ "timestamp": "2026-03-26T11:49:50.790207",
78
+ "data_sizes": {
79
+ "train": 41332,
80
+ "validation": 4862,
81
+ "test": 2432
82
+ }
83
+ }