File size: 6,641 Bytes

---
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-classification
base_model: allenai/longformer-base-4096
tags:
  - text-classification
  - longformer
  - fake-news-detection
  - misinformation-detection
  - news-classification
  - multi-dataset
  - vertex-ai
  - pytorch
  - transformers
---

# Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier

Version: 2.0  
Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)

Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.  
This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.

---

## Why v2 Is a Major Upgrade

This release reflects a full production-style training effort:

- Multi-dataset training pipeline with unified label mapping
- Long-context architecture for article-length text
- Distributed training orchestration on Vertex AI
- Reliability-focused artifact save strategy
- Metric-based checkpoint selection using weighted F1
- Early stopping for better generalization
- Hardened cloud training flow for long runs

---

## Model Overview

- Base model: allenai/longformer-base-4096
- Task: Binary text classification
- Labels:
  - 0 = REAL
  - 1 = FAKE
- Max sequence length: 1024
- Approximate parameter count: about 149M
- Framework stack:
  - Hugging Face Transformers Trainer
  - PyTorch
  - Accelerate
- Training platform: Google Cloud Vertex AI

---

## Training Data

This model was trained on a merged corpus from:

- ISOT Fake News Dataset
  - True.csv
  - Fake.csv
- LIAR
  - train.tsv
  - valid.tsv
- FEVER
  - train.jsonl

Language: English

### Label Harmonization

A consistent binary mapping was applied across all sources:

- ISOT:
  - True.csv -> 0
  - Fake.csv -> 1
- LIAR:
  - false, barely-true, pants-fire -> 1
  - all remaining LIAR labels -> 0
- FEVER:
  - SUPPORTS -> 0
  - REFUTES -> 1
  - NOT ENOUGH INFO excluded

### Text Construction

- ISOT input text: title + text
- LIAR input text: statement + speaker
- FEVER input text: claim

### Data Processing

- Unified schema to fulltext and label
- Dropped empty and trivial text rows
- Merged all sources into one corpus
- Shuffled with seed 42
- Train/test split: 90/10 with seed 42

---

## Tokenization and Longformer Attention

Tokenizer:
- AutoTokenizer from allenai/longformer-base-4096

Tokenization config:
- padding: max_length
- truncation: true
- max_length: 1024

Global attention mask:
- first token set to 1
- all remaining tokens set to 0

This global-attention setup is applied in both training and inference.

---

## Training Configuration

Model initialization:

    from transformers import AutoModelForSequenceClassification

    model = AutoModelForSequenceClassification.from_pretrained(
        "allenai/longformer-base-4096",
        num_labels=2,
    )

Training arguments used for v2:

- evaluation_strategy: epoch
- save_strategy: epoch
- learning_rate: 2e-5
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- gradient_accumulation_steps: 2
- num_train_epochs: 3
- warmup_ratio: 0.06
- weight_decay: 0.01
- lr_scheduler_type: cosine
- label_smoothing_factor: 0.1
- fp16: true
- tf32: true
- gradient_checkpointing: false
- load_best_model_at_end: true
- metric_for_best_model: f1
- early_stopping_patience: 2
- save_total_limit: 2
- push_to_hub: false
- report_to: none
- logging_strategy: steps
- logging_steps: 10
- ddp_find_unused_parameters: false

---

## Evaluation

Metrics computed during validation:
- accuracy
- weighted F1

Best checkpoint selection:
- weighted F1

You can optionally append final run stats from trainer logs:
- global steps
- training runtime
- final training loss
- final validation loss
- final accuracy
- final weighted F1

---

## Reliability and Engineering Notes

This project includes reliability safeguards for long cloud runs:

- Distributed launch through Accelerate
- Rank-aware preprocessing to avoid cache write collisions
- Explicit distributed process-group cleanup to avoid NCCL warnings
- Multi-destination save strategy:
  - Vertex model output path
  - primary GCS path
  - timestamped backup GCS path
  - local backup copy
- Upload retry logic with verification checks

These controls were added to avoid silent artifact-loss failures after long training jobs.

---

## Inference Example

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch

    model_name = "PushkarKumar/veritas_ai_v2"

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    model.eval()

    id2label = {0: "REAL", 1: "FAKE"}

    def classify(text: str):
        inputs = tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=1024,
            return_tensors="pt",
        )

        global_attention_mask = torch.zeros_like(inputs["input_ids"])
        global_attention_mask[:, 0] = 1
        inputs["global_attention_mask"] = global_attention_mask

        with torch.no_grad():
            outputs = model(**inputs)

        probs = torch.softmax(outputs.logits, dim=-1)
        pred_id = int(torch.argmax(probs, dim=-1).item())

        return {
            "label": id2label[pred_id],
            "score": float(probs[0, pred_id]),
        }

---

## Intended Use

Recommended:
- misinformation research
- content triage with human review
- NLP prototyping and benchmarking

Not recommended:
- fully automated moderation without human oversight
- legal, medical, civic, or safety-critical decision-making
- standalone fact-checking without external evidence workflows

---

## Limitations and Bias

- English-focused training data; multilingual performance is not guaranteed
- Dataset-derived labels can carry source/style/political bias
- Mixed claim-style and article-style supervision can create domain-shift effects
- Performance may degrade on niche misinformation domains
- Confidence scores are not factual certainty
- Model outputs should support, not replace, human fact-checkers

---

## Ethical Use

This model should be used as an assistive signal, not an autonomous truth system.  
Predictions should be reviewed with evidence retrieval, source validation, and human judgment.

---

## Author and Versioning

- Author: Pushkar Kumar
- Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
- Current release: Veritas AI v2