File size: 6,641 Bytes
ef81e58
 
 
4f62dd1
 
ef81e58
4f62dd1
ef81e58
4f62dd1
 
 
 
 
 
 
 
 
ef81e58
 
4f62dd1
ef81e58
4f62dd1
 
 
 
 
ef81e58
 
 
4f62dd1
 
 
 
 
 
 
 
 
 
 
ef81e58
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
 
 
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
 
 
4f62dd1
 
 
 
 
 
 
 
 
ef81e58
4f62dd1
 
 
 
 
ef81e58
 
 
4f62dd1
 
 
 
 
 
 
 
 
 
ef81e58
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
 
 
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
 
 
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
4f62dd1
ef81e58
 
 
4f62dd1
ef81e58
4f62dd1
 
ef81e58
4f62dd1
ef81e58
4f62dd1
 
 
ef81e58
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
 
 
 
 
4f62dd1
 
 
 
 
 
 
 
 
 
 
 
 
ef81e58
 
 
4f62dd1
ef81e58
4f62dd1
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-classification
base_model: allenai/longformer-base-4096
tags:
  - text-classification
  - longformer
  - fake-news-detection
  - misinformation-detection
  - news-classification
  - multi-dataset
  - vertex-ai
  - pytorch
  - transformers
---

# Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier

Version: 2.0  
Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)

Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.  
This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.

---

## Why v2 Is a Major Upgrade

This release reflects a full production-style training effort:

- Multi-dataset training pipeline with unified label mapping
- Long-context architecture for article-length text
- Distributed training orchestration on Vertex AI
- Reliability-focused artifact save strategy
- Metric-based checkpoint selection using weighted F1
- Early stopping for better generalization
- Hardened cloud training flow for long runs

---

## Model Overview

- Base model: allenai/longformer-base-4096
- Task: Binary text classification
- Labels:
  - 0 = REAL
  - 1 = FAKE
- Max sequence length: 1024
- Approximate parameter count: about 149M
- Framework stack:
  - Hugging Face Transformers Trainer
  - PyTorch
  - Accelerate
- Training platform: Google Cloud Vertex AI

---

## Training Data

This model was trained on a merged corpus from:

- ISOT Fake News Dataset
  - True.csv
  - Fake.csv
- LIAR
  - train.tsv
  - valid.tsv
- FEVER
  - train.jsonl

Language: English

### Label Harmonization

A consistent binary mapping was applied across all sources:

- ISOT:
  - True.csv -> 0
  - Fake.csv -> 1
- LIAR:
  - false, barely-true, pants-fire -> 1
  - all remaining LIAR labels -> 0
- FEVER:
  - SUPPORTS -> 0
  - REFUTES -> 1
  - NOT ENOUGH INFO excluded

### Text Construction

- ISOT input text: title + text
- LIAR input text: statement + speaker
- FEVER input text: claim

### Data Processing

- Unified schema to fulltext and label
- Dropped empty and trivial text rows
- Merged all sources into one corpus
- Shuffled with seed 42
- Train/test split: 90/10 with seed 42

---

## Tokenization and Longformer Attention

Tokenizer:
- AutoTokenizer from allenai/longformer-base-4096

Tokenization config:
- padding: max_length
- truncation: true
- max_length: 1024

Global attention mask:
- first token set to 1
- all remaining tokens set to 0

This global-attention setup is applied in both training and inference.

---

## Training Configuration

Model initialization:

    from transformers import AutoModelForSequenceClassification

    model = AutoModelForSequenceClassification.from_pretrained(
        "allenai/longformer-base-4096",
        num_labels=2,
    )

Training arguments used for v2:

- evaluation_strategy: epoch
- save_strategy: epoch
- learning_rate: 2e-5
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- gradient_accumulation_steps: 2
- num_train_epochs: 3
- warmup_ratio: 0.06
- weight_decay: 0.01
- lr_scheduler_type: cosine
- label_smoothing_factor: 0.1
- fp16: true
- tf32: true
- gradient_checkpointing: false
- load_best_model_at_end: true
- metric_for_best_model: f1
- early_stopping_patience: 2
- save_total_limit: 2
- push_to_hub: false
- report_to: none
- logging_strategy: steps
- logging_steps: 10
- ddp_find_unused_parameters: false

---

## Evaluation

Metrics computed during validation:
- accuracy
- weighted F1

Best checkpoint selection:
- weighted F1

You can optionally append final run stats from trainer logs:
- global steps
- training runtime
- final training loss
- final validation loss
- final accuracy
- final weighted F1

---

## Reliability and Engineering Notes

This project includes reliability safeguards for long cloud runs:

- Distributed launch through Accelerate
- Rank-aware preprocessing to avoid cache write collisions
- Explicit distributed process-group cleanup to avoid NCCL warnings
- Multi-destination save strategy:
  - Vertex model output path
  - primary GCS path
  - timestamped backup GCS path
  - local backup copy
- Upload retry logic with verification checks

These controls were added to avoid silent artifact-loss failures after long training jobs.

---

## Inference Example

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch

    model_name = "PushkarKumar/veritas_ai_v2"

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    model.eval()

    id2label = {0: "REAL", 1: "FAKE"}

    def classify(text: str):
        inputs = tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=1024,
            return_tensors="pt",
        )

        global_attention_mask = torch.zeros_like(inputs["input_ids"])
        global_attention_mask[:, 0] = 1
        inputs["global_attention_mask"] = global_attention_mask

        with torch.no_grad():
            outputs = model(**inputs)

        probs = torch.softmax(outputs.logits, dim=-1)
        pred_id = int(torch.argmax(probs, dim=-1).item())

        return {
            "label": id2label[pred_id],
            "score": float(probs[0, pred_id]),
        }

---

## Intended Use

Recommended:
- misinformation research
- content triage with human review
- NLP prototyping and benchmarking

Not recommended:
- fully automated moderation without human oversight
- legal, medical, civic, or safety-critical decision-making
- standalone fact-checking without external evidence workflows

---

## Limitations and Bias

- English-focused training data; multilingual performance is not guaranteed
- Dataset-derived labels can carry source/style/political bias
- Mixed claim-style and article-style supervision can create domain-shift effects
- Performance may degrade on niche misinformation domains
- Confidence scores are not factual certainty
- Model outputs should support, not replace, human fact-checkers

---

## Ethical Use

This model should be used as an assistive signal, not an autonomous truth system.  
Predictions should be reviewed with evidence retrieval, source validation, and human judgment.

---

## Author and Versioning

- Author: Pushkar Kumar
- Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
- Current release: Veritas AI v2