Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier

Version: 2.0
Previous version: PushkarKumar/veritas_ai_new

Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.


Why v2 Is a Major Upgrade

This release reflects a full production-style training effort:

  • Multi-dataset training pipeline with unified label mapping
  • Long-context architecture for article-length text
  • Distributed training orchestration on Vertex AI
  • Reliability-focused artifact save strategy
  • Metric-based checkpoint selection using weighted F1
  • Early stopping for better generalization
  • Hardened cloud training flow for long runs

Model Overview

  • Base model: allenai/longformer-base-4096
  • Task: Binary text classification
  • Labels:
    • 0 = REAL
    • 1 = FAKE
  • Max sequence length: 1024
  • Approximate parameter count: about 149M
  • Framework stack:
    • Hugging Face Transformers Trainer
    • PyTorch
    • Accelerate
  • Training platform: Google Cloud Vertex AI

Training Data

This model was trained on a merged corpus from:

  • ISOT Fake News Dataset
    • True.csv
    • Fake.csv
  • LIAR
    • train.tsv
    • valid.tsv
  • FEVER
    • train.jsonl

Language: English

Label Harmonization

A consistent binary mapping was applied across all sources:

  • ISOT:
    • True.csv -> 0
    • Fake.csv -> 1
  • LIAR:
    • false, barely-true, pants-fire -> 1
    • all remaining LIAR labels -> 0
  • FEVER:
    • SUPPORTS -> 0
    • REFUTES -> 1
    • NOT ENOUGH INFO excluded

Text Construction

  • ISOT input text: title + text
  • LIAR input text: statement + speaker
  • FEVER input text: claim

Data Processing

  • Unified schema to fulltext and label
  • Dropped empty and trivial text rows
  • Merged all sources into one corpus
  • Shuffled with seed 42
  • Train/test split: 90/10 with seed 42

Tokenization and Longformer Attention

Tokenizer:

  • AutoTokenizer from allenai/longformer-base-4096

Tokenization config:

  • padding: max_length
  • truncation: true
  • max_length: 1024

Global attention mask:

  • first token set to 1
  • all remaining tokens set to 0

This global-attention setup is applied in both training and inference.


Training Configuration

Model initialization:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=2,
)

Training arguments used for v2:

  • evaluation_strategy: epoch
  • save_strategy: epoch
  • learning_rate: 2e-5
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • gradient_accumulation_steps: 2
  • num_train_epochs: 3
  • warmup_ratio: 0.06
  • weight_decay: 0.01
  • lr_scheduler_type: cosine
  • label_smoothing_factor: 0.1
  • fp16: true
  • tf32: true
  • gradient_checkpointing: false
  • load_best_model_at_end: true
  • metric_for_best_model: f1
  • early_stopping_patience: 2
  • save_total_limit: 2
  • push_to_hub: false
  • report_to: none
  • logging_strategy: steps
  • logging_steps: 10
  • ddp_find_unused_parameters: false

Evaluation

Metrics computed during validation:

  • accuracy
  • weighted F1

Best checkpoint selection:

  • weighted F1

You can optionally append final run stats from trainer logs:

  • global steps
  • training runtime
  • final training loss
  • final validation loss
  • final accuracy
  • final weighted F1

Reliability and Engineering Notes

This project includes reliability safeguards for long cloud runs:

  • Distributed launch through Accelerate
  • Rank-aware preprocessing to avoid cache write collisions
  • Explicit distributed process-group cleanup to avoid NCCL warnings
  • Multi-destination save strategy:
    • Vertex model output path
    • primary GCS path
    • timestamped backup GCS path
    • local backup copy
  • Upload retry logic with verification checks

These controls were added to avoid silent artifact-loss failures after long training jobs.


Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "PushkarKumar/veritas_ai_v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

id2label = {0: "REAL", 1: "FAKE"}

def classify(text: str):
    inputs = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    )

    global_attention_mask = torch.zeros_like(inputs["input_ids"])
    global_attention_mask[:, 0] = 1
    inputs["global_attention_mask"] = global_attention_mask

    with torch.no_grad():
        outputs = model(**inputs)

    probs = torch.softmax(outputs.logits, dim=-1)
    pred_id = int(torch.argmax(probs, dim=-1).item())

    return {
        "label": id2label[pred_id],
        "score": float(probs[0, pred_id]),
    }

Intended Use

Recommended:

  • misinformation research
  • content triage with human review
  • NLP prototyping and benchmarking

Not recommended:

  • fully automated moderation without human oversight
  • legal, medical, civic, or safety-critical decision-making
  • standalone fact-checking without external evidence workflows

Limitations and Bias

  • English-focused training data; multilingual performance is not guaranteed
  • Dataset-derived labels can carry source/style/political bias
  • Mixed claim-style and article-style supervision can create domain-shift effects
  • Performance may degrade on niche misinformation domains
  • Confidence scores are not factual certainty
  • Model outputs should support, not replace, human fact-checkers

Ethical Use

This model should be used as an assistive signal, not an autonomous truth system.
Predictions should be reviewed with evidence retrieval, source validation, and human judgment.


Author and Versioning

Downloads last month
61
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for PushkarKumar/veritas_ai_v2

Finetuned
(134)
this model

Space using PushkarKumar/veritas_ai_v2 1