PushkarKumar commited on
Commit
ef81e58
·
verified ·
1 Parent(s): ea4440c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -3
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - allenai/longformer-base-4096
7
+ pipeline_tag: text-classification
8
+ tags:
9
+ - longformer
10
+ - fake-news-detection
11
+ - news
12
+ - misinformation
13
+ - multi-dataset
14
+ ---
15
+ # Veritas AI v2 — Multi-Dataset Fake News & Misinformation Classifier (Longformer)
16
+
17
+ > **Version:** 2.0 | **Previous version:** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
18
+
19
+ A binary text-classification model that fine-tunes `allenai/longformer-base-4096` to classify long-form news articles as **REAL** or **FAKE**. This is an upgraded version of `veritas_ai_new`, retrained on a significantly larger and more diverse multi-dataset combination to improve generalization and robustness beyond a single news domain.
20
+
21
+ ---
22
+
23
+ ## Model
24
+
25
+ - **Base model:** `allenai/longformer-base-4096`
26
+ - **Task:** Binary text classification (REAL / FAKE)
27
+ - **Labels:** `0` = REAL, `1` = FAKE
28
+ - **Max sequence length used:** 1024 tokens
29
+ - **Parameters:** ~0.1B (same architecture as `longformer-base-4096` with a newly initialized 2-class classifier head)
30
+ - **Framework:** Hugging Face `transformers` (Trainer API)
31
+ - **Training platform:** Google Cloud Platform (Vertex AI)
32
+
33
+ ---
34
+
35
+ ## What's New in v2
36
+
37
+ - Trained on **multiple datasets** (multi-source) instead of only the ISOT Fake News Dataset used in v1
38
+ - Larger and more diverse training corpus for improved cross-domain generalization
39
+ - Additional preprocessing and dataset-balancing steps applied
40
+ - *(Further changelog details to be added)*
41
+
42
+ ---
43
+
44
+ ## Data
45
+
46
+ - **Datasets:** *(To be filled — list all datasets used)*
47
+ - **Languages:** English
48
+ - **Preprocessing:**
49
+ - Added `label` column: `0` for REAL, `1` for FAKE
50
+ - Concatenated `title` and `text` into `full_text`
51
+ - Shuffled combined data with `random_state=42`
52
+ - Multi-dataset merging and deduplication applied
53
+ - Train/test split: 80% / 20%, stratified by `label`
54
+ - **Dataset statistics:** *(To be filled — total examples, label distribution)*
55
+
56
+ ---
57
+
58
+ ## Tokenization
59
+
60
+ - **Tokenizer:** `AutoTokenizer.from_pretrained("allenai/longformer-base-4096")`
61
+ - **Settings:**
62
+ - `padding="max_length"`
63
+ - `truncation=True`
64
+ - `max_length=1024`
65
+ - **Global attention mask:** First token (`[CLS]`) set to 1, rest 0 — applied during both training and inference
66
+
67
+ ---
68
+
69
+ ## Training Setup
70
+
71
+ **Model init**
72
+
73
+ ```python
74
+ model = AutoModelForSequenceClassification.from_pretrained(
75
+ "allenai/longformer-base-4096",
76
+ num_labels=2,
77
+ )
78
+ ```
79
+
80
+ **TrainingArguments**
81
+
82
+ - `evaluation_strategy` = `"epoch"`
83
+ - `save_strategy` = `"epoch"`
84
+ - `learning_rate` = `2e-5`
85
+ - `per_device_train_batch_size` = `1`
86
+ - `per_device_eval_batch_size` = `1`
87
+ - `gradient_accumulation_steps` = `4`
88
+ - `num_train_epochs` = *(To be filled)*
89
+ - `weight_decay` = `0.01`
90
+ - `fp16` = `True`
91
+ - `gradient_checkpointing` = `True`
92
+ - `load_best_model_at_end` = `True`
93
+ - `push_to_hub` = `False`
94
+ - `report_to` = `"none"`
95
+
96
+ ---
97
+
98
+ ## Training and Evaluation
99
+
100
+ - **Epochs:** *(To be filled)*
101
+ - **Global steps:** *(To be filled)*
102
+ - **Training runtime:** *(To be filled)*
103
+ - **Losses:**
104
+ - Training loss: *(To be filled)*
105
+ - Validation loss: *(To be filled)*
106
+ - **Metrics:** *(To be filled — accuracy, F1, precision, recall if computed)*
107
+
108
+ ---
109
+
110
+ ## Inference
111
+
112
+ Minimal example for using the model from the Hub:
113
+
114
+ ```python
115
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
116
+ import torch
117
+
118
+ model_name = "PushkarKumar/veritas_ai_v2"
119
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
120
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
121
+ model.eval()
122
+
123
+ def classify(text: str):
124
+ inputs = tokenizer(
125
+ text,
126
+ padding="max_length",
127
+ truncation=True,
128
+ max_length=1024,
129
+ return_tensors="pt",
130
+ )
131
+ global_attention_mask = torch.zeros(
132
+ inputs["input_ids"].shape, dtype=torch.long
133
+ )
134
+ global_attention_mask[:, 0] = 1
135
+ inputs["global_attention_mask"] = global_attention_mask
136
+
137
+ with torch.no_grad():
138
+ outputs = model(**inputs)
139
+ probs = torch.softmax(outputs.logits, dim=1)
140
+ label_id = int(torch.argmax(probs))
141
+ labels = {0: "REAL", 1: "FAKE"}
142
+ return labels[label_id], float(probs[0][label_id])
143
+ ```
144
+
145
+ ---
146
+
147
+ ## Limitations and Bias
148
+
149
+ - Trained primarily on English-language news datasets; performance on other languages is not guaranteed.
150
+ - Labels are based on data-source heuristics (e.g., credible outlets vs. unreliable sites), not article-level fact-checking, and may encode source or political bias.
151
+ - While trained on multiple datasets for broader coverage, the model may still underperform on highly specialized or domain-specific misinformation (e.g., scientific misinformation, satire).
152
+ - The model should **not** be used as an automated fact-checker or for high-stakes decisions without human oversight.
153
+
154
+ ---
155
+
156
+ ## Author
157
+
158
+ - **Author:** Pushkar Kumar
159
+ - **v1 (base):** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)