Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier

Version: 2.0
Previous version: PushkarKumar/veritas_ai_new

Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.

Why v2 Is a Major Upgrade

This release reflects a full production-style training effort:

Multi-dataset training pipeline with unified label mapping
Long-context architecture for article-length text
Distributed training orchestration on Vertex AI
Reliability-focused artifact save strategy
Metric-based checkpoint selection using weighted F1
Early stopping for better generalization
Hardened cloud training flow for long runs

Model Overview

Base model: allenai/longformer-base-4096
Task: Binary text classification
Labels:
- 0 = REAL
- 1 = FAKE
Max sequence length: 1024
Approximate parameter count: about 149M
Framework stack:
- Hugging Face Transformers Trainer
- PyTorch
- Accelerate
Training platform: Google Cloud Vertex AI

Training Data

This model was trained on a merged corpus from:

ISOT Fake News Dataset
- True.csv
- Fake.csv
LIAR
- train.tsv
- valid.tsv
FEVER
- train.jsonl

Language: English

Label Harmonization

A consistent binary mapping was applied across all sources:

ISOT:
- True.csv -> 0
- Fake.csv -> 1
LIAR:
- false, barely-true, pants-fire -> 1
- all remaining LIAR labels -> 0
FEVER:
- SUPPORTS -> 0
- REFUTES -> 1
- NOT ENOUGH INFO excluded

Text Construction

ISOT input text: title + text
LIAR input text: statement + speaker
FEVER input text: claim

Data Processing

Unified schema to fulltext and label
Dropped empty and trivial text rows
Merged all sources into one corpus
Shuffled with seed 42
Train/test split: 90/10 with seed 42

Tokenization and Longformer Attention

Tokenizer:

AutoTokenizer from allenai/longformer-base-4096

Tokenization config:

padding: max_length
truncation: true
max_length: 1024

Global attention mask:

first token set to 1
all remaining tokens set to 0

This global-attention setup is applied in both training and inference.

Training Configuration

Model initialization:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=2,
)

Training arguments used for v2:

evaluation_strategy: epoch
save_strategy: epoch
learning_rate: 2e-5
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
gradient_accumulation_steps: 2
num_train_epochs: 3
warmup_ratio: 0.06
weight_decay: 0.01
lr_scheduler_type: cosine
label_smoothing_factor: 0.1
fp16: true
tf32: true
gradient_checkpointing: false
load_best_model_at_end: true
metric_for_best_model: f1
early_stopping_patience: 2
save_total_limit: 2
push_to_hub: false
report_to: none
logging_strategy: steps
logging_steps: 10
ddp_find_unused_parameters: false

Evaluation

Metrics computed during validation:

accuracy
weighted F1

Best checkpoint selection:

weighted F1

You can optionally append final run stats from trainer logs:

global steps
training runtime
final training loss
final validation loss
final accuracy
final weighted F1

Reliability and Engineering Notes

This project includes reliability safeguards for long cloud runs:

Distributed launch through Accelerate
Rank-aware preprocessing to avoid cache write collisions
Explicit distributed process-group cleanup to avoid NCCL warnings
Multi-destination save strategy:
- Vertex model output path
- primary GCS path
- timestamped backup GCS path
- local backup copy
Upload retry logic with verification checks

These controls were added to avoid silent artifact-loss failures after long training jobs.

Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "PushkarKumar/veritas_ai_v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

id2label = {0: "REAL", 1: "FAKE"}

def classify(text: str):
    inputs = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    )

    global_attention_mask = torch.zeros_like(inputs["input_ids"])
    global_attention_mask[:, 0] = 1
    inputs["global_attention_mask"] = global_attention_mask

    with torch.no_grad():
        outputs = model(**inputs)

    probs = torch.softmax(outputs.logits, dim=-1)
    pred_id = int(torch.argmax(probs, dim=-1).item())

    return {
        "label": id2label[pred_id],
        "score": float(probs[0, pred_id]),
    }

Intended Use

Recommended:

misinformation research
content triage with human review
NLP prototyping and benchmarking

Not recommended:

fully automated moderation without human oversight
legal, medical, civic, or safety-critical decision-making
standalone fact-checking without external evidence workflows

Limitations and Bias

English-focused training data; multilingual performance is not guaranteed
Dataset-derived labels can carry source/style/political bias
Mixed claim-style and article-style supervision can create domain-shift effects
Performance may degrade on niche misinformation domains
Confidence scores are not factual certainty
Model outputs should support, not replace, human fact-checkers

Ethical Use

This model should be used as an assistive signal, not an autonomous truth system.
Predictions should be reviewed with evidence retrieval, source validation, and human judgment.

Author and Versioning

Author: Pushkar Kumar
Previous release: PushkarKumar/veritas_ai_new
Current release: Veritas AI v2

Downloads last month: 61

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for PushkarKumar/veritas_ai_v2

Base model

allenai/longformer-base-4096

Finetuned

(134)

this model

PushkarKumar
/

veritas_ai_v2