PushkarKumar commited on
Commit
4f62dd1
·
verified ·
1 Parent(s): 610d82d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +228 -111
README.md CHANGED
@@ -1,159 +1,276 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - en
5
- base_model:
6
- - allenai/longformer-base-4096
7
  pipeline_tag: text-classification
 
8
  tags:
9
- - longformer
10
- - fake-news-detection
11
- - news
12
- - misinformation
13
- - multi-dataset
 
 
 
 
14
  ---
15
- # Veritas AI v2 — Multi-Dataset Fake News & Misinformation Classifier (Longformer)
16
 
17
- > **Version:** 2.0 | **Previous version:** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
18
 
19
- A binary text-classification model that fine-tunes `allenai/longformer-base-4096` to classify long-form news articles as **REAL** or **FAKE**. This is an upgraded version of `veritas_ai_new`, retrained on a significantly larger and more diverse multi-dataset combination to improve generalization and robustness beyond a single news domain.
 
 
 
 
20
 
21
  ---
22
 
23
- ## Model
 
 
 
 
 
 
 
 
 
 
24
 
25
- - **Base model:** `allenai/longformer-base-4096`
26
- - **Task:** Binary text classification (REAL / FAKE)
27
- - **Labels:** `0` = REAL, `1` = FAKE
28
- - **Max sequence length used:** 1024 tokens
29
- - **Parameters:** ~0.1B (same architecture as `longformer-base-4096` with a newly initialized 2-class classifier head)
30
- - **Framework:** Hugging Face `transformers` (Trainer API)
31
- - **Training platform:** Google Cloud Platform (Vertex AI)
 
 
 
 
 
 
 
 
 
32
 
33
  ---
34
 
35
- ## What's New in v2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- - Trained on **multiple datasets** (multi-source) instead of only the ISOT Fake News Dataset used in v1
38
- - Larger and more diverse training corpus for improved cross-domain generalization
39
- - Additional preprocessing and dataset-balancing steps applied
40
- - *(Further changelog details to be added)*
 
 
 
 
 
 
 
 
 
41
 
42
  ---
43
 
44
- ## Data
 
 
 
 
 
 
 
 
45
 
46
- - **Datasets:** *(To be filled — list all datasets used)*
47
- - **Languages:** English
48
- - **Preprocessing:**
49
- - Added `label` column: `0` for REAL, `1` for FAKE
50
- - Concatenated `title` and `text` into `full_text`
51
- - Shuffled combined data with `random_state=42`
52
- - Multi-dataset merging and deduplication applied
53
- - Train/test split: 80% / 20%, stratified by `label`
54
- - **Dataset statistics:** *(To be filled — total examples, label distribution)*
55
 
56
  ---
57
 
58
- ## Tokenization
 
 
 
 
 
 
 
 
 
59
 
60
- - **Tokenizer:** `AutoTokenizer.from_pretrained("allenai/longformer-base-4096")`
61
- - **Settings:**
62
- - `padding="max_length"`
63
- - `truncation=True`
64
- - `max_length=1024`
65
- - **Global attention mask:** First token (`[CLS]`) set to 1, rest 0 — applied during both training and inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ---
68
 
69
- ## Training Setup
70
-
71
- **Model init**
72
-
73
- ```python
74
- model = AutoModelForSequenceClassification.from_pretrained(
75
- "allenai/longformer-base-4096",
76
- num_labels=2,
77
- )
78
- ```
79
-
80
- **TrainingArguments**
81
-
82
- - `evaluation_strategy` = `"epoch"`
83
- - `save_strategy` = `"epoch"`
84
- - `learning_rate` = `2e-5`
85
- - `per_device_train_batch_size` = `1`
86
- - `per_device_eval_batch_size` = `1`
87
- - `gradient_accumulation_steps` = `4`
88
- - `num_train_epochs` = *(To be filled)*
89
- - `weight_decay` = `0.01`
90
- - `fp16` = `True`
91
- - `gradient_checkpointing` = `True`
92
- - `load_best_model_at_end` = `True`
93
- - `push_to_hub` = `False`
94
- - `report_to` = `"none"`
95
 
96
  ---
97
 
98
- ## Training and Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- - **Epochs:** *(To be filled)*
101
- - **Global steps:** *(To be filled)*
102
- - **Training runtime:** *(To be filled)*
103
- - **Losses:**
104
- - Training loss: *(To be filled)*
105
- - Validation loss: *(To be filled)*
106
- - **Metrics:** *(To be filled — accuracy, F1, precision, recall if computed)*
107
 
108
  ---
109
 
110
- ## Inference
111
 
112
- Minimal example for using the model from the Hub:
 
113
 
114
- ```python
115
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
116
- import torch
117
 
118
- model_name = "PushkarKumar/veritas_ai_v2"
119
- tokenizer = AutoTokenizer.from_pretrained(model_name)
120
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
121
- model.eval()
122
 
123
- def classify(text: str):
124
- inputs = tokenizer(
125
- text,
126
- padding="max_length",
127
- truncation=True,
128
- max_length=1024,
129
- return_tensors="pt",
130
- )
131
- global_attention_mask = torch.zeros(
132
- inputs["input_ids"].shape, dtype=torch.long
133
- )
134
- global_attention_mask[:, 0] = 1
135
- inputs["global_attention_mask"] = global_attention_mask
 
 
 
 
136
 
137
- with torch.no_grad():
138
- outputs = model(**inputs)
139
- probs = torch.softmax(outputs.logits, dim=1)
140
- label_id = int(torch.argmax(probs))
141
- labels = {0: "REAL", 1: "FAKE"}
142
- return labels[label_id], float(probs[0][label_id])
143
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
 
145
  ---
146
 
147
  ## Limitations and Bias
148
 
149
- - Trained primarily on English-language news datasets; performance on other languages is not guaranteed.
150
- - Labels are based on data-source heuristics (e.g., credible outlets vs. unreliable sites), not article-level fact-checking, and may encode source or political bias.
151
- - While trained on multiple datasets for broader coverage, the model may still underperform on highly specialized or domain-specific misinformation (e.g., scientific misinformation, satire).
152
- - The model should **not** be used as an automated fact-checker or for high-stakes decisions without human oversight.
 
 
 
 
 
 
 
 
 
153
 
154
  ---
155
 
156
- ## Author
157
 
158
- - **Author:** Pushkar Kumar
159
- - **v1 (base):** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
 
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
+ library_name: transformers
 
6
  pipeline_tag: text-classification
7
+ base_model: allenai/longformer-base-4096
8
  tags:
9
+ - text-classification
10
+ - longformer
11
+ - fake-news-detection
12
+ - misinformation-detection
13
+ - news-classification
14
+ - multi-dataset
15
+ - vertex-ai
16
+ - pytorch
17
+ - transformers
18
  ---
 
19
 
20
+ # Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier
21
 
22
+ Version: 2.0
23
+ Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
24
+
25
+ Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
26
+ This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.
27
 
28
  ---
29
 
30
+ ## Why v2 Is a Major Upgrade
31
+
32
+ This release reflects a full production-style training effort:
33
+
34
+ - Multi-dataset training pipeline with unified label mapping
35
+ - Long-context architecture for article-length text
36
+ - Distributed training orchestration on Vertex AI
37
+ - Reliability-focused artifact save strategy
38
+ - Metric-based checkpoint selection using weighted F1
39
+ - Early stopping for better generalization
40
+ - Hardened cloud training flow for long runs
41
 
42
+ ---
43
+
44
+ ## Model Overview
45
+
46
+ - Base model: allenai/longformer-base-4096
47
+ - Task: Binary text classification
48
+ - Labels:
49
+ - 0 = REAL
50
+ - 1 = FAKE
51
+ - Max sequence length: 1024
52
+ - Approximate parameter count: about 149M
53
+ - Framework stack:
54
+ - Hugging Face Transformers Trainer
55
+ - PyTorch
56
+ - Accelerate
57
+ - Training platform: Google Cloud Vertex AI
58
 
59
  ---
60
 
61
+ ## Training Data
62
+
63
+ This model was trained on a merged corpus from:
64
+
65
+ - ISOT Fake News Dataset
66
+ - True.csv
67
+ - Fake.csv
68
+ - LIAR
69
+ - train.tsv
70
+ - valid.tsv
71
+ - FEVER
72
+ - train.jsonl
73
+
74
+ Language: English
75
+
76
+ ### Label Harmonization
77
+
78
+ A consistent binary mapping was applied across all sources:
79
+
80
+ - ISOT:
81
+ - True.csv -> 0
82
+ - Fake.csv -> 1
83
+ - LIAR:
84
+ - false, barely-true, pants-fire -> 1
85
+ - all remaining LIAR labels -> 0
86
+ - FEVER:
87
+ - SUPPORTS -> 0
88
+ - REFUTES -> 1
89
+ - NOT ENOUGH INFO excluded
90
 
91
+ ### Text Construction
92
+
93
+ - ISOT input text: title + text
94
+ - LIAR input text: statement + speaker
95
+ - FEVER input text: claim
96
+
97
+ ### Data Processing
98
+
99
+ - Unified schema to fulltext and label
100
+ - Dropped empty and trivial text rows
101
+ - Merged all sources into one corpus
102
+ - Shuffled with seed 42
103
+ - Train/test split: 90/10 with seed 42
104
 
105
  ---
106
 
107
+ ## Tokenization and Longformer Attention
108
+
109
+ Tokenizer:
110
+ - AutoTokenizer from allenai/longformer-base-4096
111
+
112
+ Tokenization config:
113
+ - padding: max_length
114
+ - truncation: true
115
+ - max_length: 1024
116
 
117
+ Global attention mask:
118
+ - first token set to 1
119
+ - all remaining tokens set to 0
120
+
121
+ This global-attention setup is applied in both training and inference.
 
 
 
 
122
 
123
  ---
124
 
125
+ ## Training Configuration
126
+
127
+ Model initialization:
128
+
129
+ from transformers import AutoModelForSequenceClassification
130
+
131
+ model = AutoModelForSequenceClassification.from_pretrained(
132
+ "allenai/longformer-base-4096",
133
+ num_labels=2,
134
+ )
135
 
136
+ Training arguments used for v2:
137
+
138
+ - evaluation_strategy: epoch
139
+ - save_strategy: epoch
140
+ - learning_rate: 2e-5
141
+ - per_device_train_batch_size: 8
142
+ - per_device_eval_batch_size: 8
143
+ - gradient_accumulation_steps: 2
144
+ - num_train_epochs: 3
145
+ - warmup_ratio: 0.06
146
+ - weight_decay: 0.01
147
+ - lr_scheduler_type: cosine
148
+ - label_smoothing_factor: 0.1
149
+ - fp16: true
150
+ - tf32: true
151
+ - gradient_checkpointing: false
152
+ - load_best_model_at_end: true
153
+ - metric_for_best_model: f1
154
+ - early_stopping_patience: 2
155
+ - save_total_limit: 2
156
+ - push_to_hub: false
157
+ - report_to: none
158
+ - logging_strategy: steps
159
+ - logging_steps: 10
160
+ - ddp_find_unused_parameters: false
161
 
162
  ---
163
 
164
+ ## Evaluation
165
+
166
+ Metrics computed during validation:
167
+ - accuracy
168
+ - weighted F1
169
+
170
+ Best checkpoint selection:
171
+ - weighted F1
172
+
173
+ You can optionally append final run stats from trainer logs:
174
+ - global steps
175
+ - training runtime
176
+ - final training loss
177
+ - final validation loss
178
+ - final accuracy
179
+ - final weighted F1
 
 
 
 
 
 
 
 
 
 
180
 
181
  ---
182
 
183
+ ## Reliability and Engineering Notes
184
+
185
+ This project includes reliability safeguards for long cloud runs:
186
+
187
+ - Distributed launch through Accelerate
188
+ - Rank-aware preprocessing to avoid cache write collisions
189
+ - Explicit distributed process-group cleanup to avoid NCCL warnings
190
+ - Multi-destination save strategy:
191
+ - Vertex model output path
192
+ - primary GCS path
193
+ - timestamped backup GCS path
194
+ - local backup copy
195
+ - Upload retry logic with verification checks
196
 
197
+ These controls were added to avoid silent artifact-loss failures after long training jobs.
 
 
 
 
 
 
198
 
199
  ---
200
 
201
+ ## Inference Example
202
 
203
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
204
+ import torch
205
 
206
+ model_name = "PushkarKumar/veritas_ai_v2"
 
 
207
 
208
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
209
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
210
+ model.eval()
 
211
 
212
+ id2label = {0: "REAL", 1: "FAKE"}
213
+
214
+ def classify(text: str):
215
+ inputs = tokenizer(
216
+ text,
217
+ padding="max_length",
218
+ truncation=True,
219
+ max_length=1024,
220
+ return_tensors="pt",
221
+ )
222
+
223
+ global_attention_mask = torch.zeros_like(inputs["input_ids"])
224
+ global_attention_mask[:, 0] = 1
225
+ inputs["global_attention_mask"] = global_attention_mask
226
+
227
+ with torch.no_grad():
228
+ outputs = model(**inputs)
229
 
230
+ probs = torch.softmax(outputs.logits, dim=-1)
231
+ pred_id = int(torch.argmax(probs, dim=-1).item())
232
+
233
+ return {
234
+ "label": id2label[pred_id],
235
+ "score": float(probs[0, pred_id]),
236
+ }
237
+
238
+ ---
239
+
240
+ ## Intended Use
241
+
242
+ Recommended:
243
+ - misinformation research
244
+ - content triage with human review
245
+ - NLP prototyping and benchmarking
246
+
247
+ Not recommended:
248
+ - fully automated moderation without human oversight
249
+ - legal, medical, civic, or safety-critical decision-making
250
+ - standalone fact-checking without external evidence workflows
251
 
252
  ---
253
 
254
  ## Limitations and Bias
255
 
256
+ - English-focused training data; multilingual performance is not guaranteed
257
+ - Dataset-derived labels can carry source/style/political bias
258
+ - Mixed claim-style and article-style supervision can create domain-shift effects
259
+ - Performance may degrade on niche misinformation domains
260
+ - Confidence scores are not factual certainty
261
+ - Model outputs should support, not replace, human fact-checkers
262
+
263
+ ---
264
+
265
+ ## Ethical Use
266
+
267
+ This model should be used as an assistive signal, not an autonomous truth system.
268
+ Predictions should be reviewed with evidence retrieval, source validation, and human judgment.
269
 
270
  ---
271
 
272
+ ## Author and Versioning
273
 
274
+ - Author: Pushkar Kumar
275
+ - Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
276
+ - Current release: Veritas AI v2