Text Classification
Transformers
Safetensors
PyTorch
English
longformer
fake-news-detection
misinformation-detection
news-classification
multi-dataset
vertex-ai
Instructions to use PushkarKumar/veritas_ai_v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PushkarKumar/veritas_ai_v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="PushkarKumar/veritas_ai_v2")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("PushkarKumar/veritas_ai_v2") model = AutoModelForSequenceClassification.from_pretrained("PushkarKumar/veritas_ai_v2") - Notebooks
- Google Colab
- Kaggle
File size: 6,641 Bytes
ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 ef81e58 4f62dd1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | ---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-classification
base_model: allenai/longformer-base-4096
tags:
- text-classification
- longformer
- fake-news-detection
- misinformation-detection
- news-classification
- multi-dataset
- vertex-ai
- pytorch
- transformers
---
# Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier
Version: 2.0
Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.
---
## Why v2 Is a Major Upgrade
This release reflects a full production-style training effort:
- Multi-dataset training pipeline with unified label mapping
- Long-context architecture for article-length text
- Distributed training orchestration on Vertex AI
- Reliability-focused artifact save strategy
- Metric-based checkpoint selection using weighted F1
- Early stopping for better generalization
- Hardened cloud training flow for long runs
---
## Model Overview
- Base model: allenai/longformer-base-4096
- Task: Binary text classification
- Labels:
- 0 = REAL
- 1 = FAKE
- Max sequence length: 1024
- Approximate parameter count: about 149M
- Framework stack:
- Hugging Face Transformers Trainer
- PyTorch
- Accelerate
- Training platform: Google Cloud Vertex AI
---
## Training Data
This model was trained on a merged corpus from:
- ISOT Fake News Dataset
- True.csv
- Fake.csv
- LIAR
- train.tsv
- valid.tsv
- FEVER
- train.jsonl
Language: English
### Label Harmonization
A consistent binary mapping was applied across all sources:
- ISOT:
- True.csv -> 0
- Fake.csv -> 1
- LIAR:
- false, barely-true, pants-fire -> 1
- all remaining LIAR labels -> 0
- FEVER:
- SUPPORTS -> 0
- REFUTES -> 1
- NOT ENOUGH INFO excluded
### Text Construction
- ISOT input text: title + text
- LIAR input text: statement + speaker
- FEVER input text: claim
### Data Processing
- Unified schema to fulltext and label
- Dropped empty and trivial text rows
- Merged all sources into one corpus
- Shuffled with seed 42
- Train/test split: 90/10 with seed 42
---
## Tokenization and Longformer Attention
Tokenizer:
- AutoTokenizer from allenai/longformer-base-4096
Tokenization config:
- padding: max_length
- truncation: true
- max_length: 1024
Global attention mask:
- first token set to 1
- all remaining tokens set to 0
This global-attention setup is applied in both training and inference.
---
## Training Configuration
Model initialization:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"allenai/longformer-base-4096",
num_labels=2,
)
Training arguments used for v2:
- evaluation_strategy: epoch
- save_strategy: epoch
- learning_rate: 2e-5
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- gradient_accumulation_steps: 2
- num_train_epochs: 3
- warmup_ratio: 0.06
- weight_decay: 0.01
- lr_scheduler_type: cosine
- label_smoothing_factor: 0.1
- fp16: true
- tf32: true
- gradient_checkpointing: false
- load_best_model_at_end: true
- metric_for_best_model: f1
- early_stopping_patience: 2
- save_total_limit: 2
- push_to_hub: false
- report_to: none
- logging_strategy: steps
- logging_steps: 10
- ddp_find_unused_parameters: false
---
## Evaluation
Metrics computed during validation:
- accuracy
- weighted F1
Best checkpoint selection:
- weighted F1
You can optionally append final run stats from trainer logs:
- global steps
- training runtime
- final training loss
- final validation loss
- final accuracy
- final weighted F1
---
## Reliability and Engineering Notes
This project includes reliability safeguards for long cloud runs:
- Distributed launch through Accelerate
- Rank-aware preprocessing to avoid cache write collisions
- Explicit distributed process-group cleanup to avoid NCCL warnings
- Multi-destination save strategy:
- Vertex model output path
- primary GCS path
- timestamped backup GCS path
- local backup copy
- Upload retry logic with verification checks
These controls were added to avoid silent artifact-loss failures after long training jobs.
---
## Inference Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "PushkarKumar/veritas_ai_v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
id2label = {0: "REAL", 1: "FAKE"}
def classify(text: str):
inputs = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=1024,
return_tensors="pt",
)
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1
inputs["global_attention_mask"] = global_attention_mask
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
pred_id = int(torch.argmax(probs, dim=-1).item())
return {
"label": id2label[pred_id],
"score": float(probs[0, pred_id]),
}
---
## Intended Use
Recommended:
- misinformation research
- content triage with human review
- NLP prototyping and benchmarking
Not recommended:
- fully automated moderation without human oversight
- legal, medical, civic, or safety-critical decision-making
- standalone fact-checking without external evidence workflows
---
## Limitations and Bias
- English-focused training data; multilingual performance is not guaranteed
- Dataset-derived labels can carry source/style/political bias
- Mixed claim-style and article-style supervision can create domain-shift effects
- Performance may degrade on niche misinformation domains
- Confidence scores are not factual certainty
- Model outputs should support, not replace, human fact-checkers
---
## Ethical Use
This model should be used as an assistive signal, not an autonomous truth system.
Predictions should be reviewed with evidence retrieval, source validation, and human judgment.
---
## Author and Versioning
- Author: Pushkar Kumar
- Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
- Current release: Veritas AI v2 |