Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier
Version: 2.0
Previous version: PushkarKumar/veritas_ai_new
Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.
Why v2 Is a Major Upgrade
This release reflects a full production-style training effort:
- Multi-dataset training pipeline with unified label mapping
- Long-context architecture for article-length text
- Distributed training orchestration on Vertex AI
- Reliability-focused artifact save strategy
- Metric-based checkpoint selection using weighted F1
- Early stopping for better generalization
- Hardened cloud training flow for long runs
Model Overview
- Base model: allenai/longformer-base-4096
- Task: Binary text classification
- Labels:
- 0 = REAL
- 1 = FAKE
- Max sequence length: 1024
- Approximate parameter count: about 149M
- Framework stack:
- Hugging Face Transformers Trainer
- PyTorch
- Accelerate
- Training platform: Google Cloud Vertex AI
Training Data
This model was trained on a merged corpus from:
- ISOT Fake News Dataset
- True.csv
- Fake.csv
- LIAR
- train.tsv
- valid.tsv
- FEVER
- train.jsonl
Language: English
Label Harmonization
A consistent binary mapping was applied across all sources:
- ISOT:
- True.csv -> 0
- Fake.csv -> 1
- LIAR:
- false, barely-true, pants-fire -> 1
- all remaining LIAR labels -> 0
- FEVER:
- SUPPORTS -> 0
- REFUTES -> 1
- NOT ENOUGH INFO excluded
Text Construction
- ISOT input text: title + text
- LIAR input text: statement + speaker
- FEVER input text: claim
Data Processing
- Unified schema to fulltext and label
- Dropped empty and trivial text rows
- Merged all sources into one corpus
- Shuffled with seed 42
- Train/test split: 90/10 with seed 42
Tokenization and Longformer Attention
Tokenizer:
- AutoTokenizer from allenai/longformer-base-4096
Tokenization config:
- padding: max_length
- truncation: true
- max_length: 1024
Global attention mask:
- first token set to 1
- all remaining tokens set to 0
This global-attention setup is applied in both training and inference.
Training Configuration
Model initialization:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"allenai/longformer-base-4096",
num_labels=2,
)
Training arguments used for v2:
- evaluation_strategy: epoch
- save_strategy: epoch
- learning_rate: 2e-5
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- gradient_accumulation_steps: 2
- num_train_epochs: 3
- warmup_ratio: 0.06
- weight_decay: 0.01
- lr_scheduler_type: cosine
- label_smoothing_factor: 0.1
- fp16: true
- tf32: true
- gradient_checkpointing: false
- load_best_model_at_end: true
- metric_for_best_model: f1
- early_stopping_patience: 2
- save_total_limit: 2
- push_to_hub: false
- report_to: none
- logging_strategy: steps
- logging_steps: 10
- ddp_find_unused_parameters: false
Evaluation
Metrics computed during validation:
- accuracy
- weighted F1
Best checkpoint selection:
- weighted F1
You can optionally append final run stats from trainer logs:
- global steps
- training runtime
- final training loss
- final validation loss
- final accuracy
- final weighted F1
Reliability and Engineering Notes
This project includes reliability safeguards for long cloud runs:
- Distributed launch through Accelerate
- Rank-aware preprocessing to avoid cache write collisions
- Explicit distributed process-group cleanup to avoid NCCL warnings
- Multi-destination save strategy:
- Vertex model output path
- primary GCS path
- timestamped backup GCS path
- local backup copy
- Upload retry logic with verification checks
These controls were added to avoid silent artifact-loss failures after long training jobs.
Inference Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "PushkarKumar/veritas_ai_v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
id2label = {0: "REAL", 1: "FAKE"}
def classify(text: str):
inputs = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=1024,
return_tensors="pt",
)
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1
inputs["global_attention_mask"] = global_attention_mask
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
pred_id = int(torch.argmax(probs, dim=-1).item())
return {
"label": id2label[pred_id],
"score": float(probs[0, pred_id]),
}
Intended Use
Recommended:
- misinformation research
- content triage with human review
- NLP prototyping and benchmarking
Not recommended:
- fully automated moderation without human oversight
- legal, medical, civic, or safety-critical decision-making
- standalone fact-checking without external evidence workflows
Limitations and Bias
- English-focused training data; multilingual performance is not guaranteed
- Dataset-derived labels can carry source/style/political bias
- Mixed claim-style and article-style supervision can create domain-shift effects
- Performance may degrade on niche misinformation domains
- Confidence scores are not factual certainty
- Model outputs should support, not replace, human fact-checkers
Ethical Use
This model should be used as an assistive signal, not an autonomous truth system.
Predictions should be reviewed with evidence retrieval, source validation, and human judgment.
Author and Versioning
- Author: Pushkar Kumar
- Previous release: PushkarKumar/veritas_ai_new
- Current release: Veritas AI v2
- Downloads last month
- 61
Model tree for PushkarKumar/veritas_ai_v2
Base model
allenai/longformer-base-4096