veritas_ai_v2 / README.md
PushkarKumar's picture
Update README.md
4f62dd1 verified
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-classification
base_model: allenai/longformer-base-4096
tags:
- text-classification
- longformer
- fake-news-detection
- misinformation-detection
- news-classification
- multi-dataset
- vertex-ai
- pytorch
- transformers
---
# Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier
Version: 2.0
Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.
---
## Why v2 Is a Major Upgrade
This release reflects a full production-style training effort:
- Multi-dataset training pipeline with unified label mapping
- Long-context architecture for article-length text
- Distributed training orchestration on Vertex AI
- Reliability-focused artifact save strategy
- Metric-based checkpoint selection using weighted F1
- Early stopping for better generalization
- Hardened cloud training flow for long runs
---
## Model Overview
- Base model: allenai/longformer-base-4096
- Task: Binary text classification
- Labels:
- 0 = REAL
- 1 = FAKE
- Max sequence length: 1024
- Approximate parameter count: about 149M
- Framework stack:
- Hugging Face Transformers Trainer
- PyTorch
- Accelerate
- Training platform: Google Cloud Vertex AI
---
## Training Data
This model was trained on a merged corpus from:
- ISOT Fake News Dataset
- True.csv
- Fake.csv
- LIAR
- train.tsv
- valid.tsv
- FEVER
- train.jsonl
Language: English
### Label Harmonization
A consistent binary mapping was applied across all sources:
- ISOT:
- True.csv -> 0
- Fake.csv -> 1
- LIAR:
- false, barely-true, pants-fire -> 1
- all remaining LIAR labels -> 0
- FEVER:
- SUPPORTS -> 0
- REFUTES -> 1
- NOT ENOUGH INFO excluded
### Text Construction
- ISOT input text: title + text
- LIAR input text: statement + speaker
- FEVER input text: claim
### Data Processing
- Unified schema to fulltext and label
- Dropped empty and trivial text rows
- Merged all sources into one corpus
- Shuffled with seed 42
- Train/test split: 90/10 with seed 42
---
## Tokenization and Longformer Attention
Tokenizer:
- AutoTokenizer from allenai/longformer-base-4096
Tokenization config:
- padding: max_length
- truncation: true
- max_length: 1024
Global attention mask:
- first token set to 1
- all remaining tokens set to 0
This global-attention setup is applied in both training and inference.
---
## Training Configuration
Model initialization:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"allenai/longformer-base-4096",
num_labels=2,
)
Training arguments used for v2:
- evaluation_strategy: epoch
- save_strategy: epoch
- learning_rate: 2e-5
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- gradient_accumulation_steps: 2
- num_train_epochs: 3
- warmup_ratio: 0.06
- weight_decay: 0.01
- lr_scheduler_type: cosine
- label_smoothing_factor: 0.1
- fp16: true
- tf32: true
- gradient_checkpointing: false
- load_best_model_at_end: true
- metric_for_best_model: f1
- early_stopping_patience: 2
- save_total_limit: 2
- push_to_hub: false
- report_to: none
- logging_strategy: steps
- logging_steps: 10
- ddp_find_unused_parameters: false
---
## Evaluation
Metrics computed during validation:
- accuracy
- weighted F1
Best checkpoint selection:
- weighted F1
You can optionally append final run stats from trainer logs:
- global steps
- training runtime
- final training loss
- final validation loss
- final accuracy
- final weighted F1
---
## Reliability and Engineering Notes
This project includes reliability safeguards for long cloud runs:
- Distributed launch through Accelerate
- Rank-aware preprocessing to avoid cache write collisions
- Explicit distributed process-group cleanup to avoid NCCL warnings
- Multi-destination save strategy:
- Vertex model output path
- primary GCS path
- timestamped backup GCS path
- local backup copy
- Upload retry logic with verification checks
These controls were added to avoid silent artifact-loss failures after long training jobs.
---
## Inference Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "PushkarKumar/veritas_ai_v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
id2label = {0: "REAL", 1: "FAKE"}
def classify(text: str):
inputs = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=1024,
return_tensors="pt",
)
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1
inputs["global_attention_mask"] = global_attention_mask
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
pred_id = int(torch.argmax(probs, dim=-1).item())
return {
"label": id2label[pred_id],
"score": float(probs[0, pred_id]),
}
---
## Intended Use
Recommended:
- misinformation research
- content triage with human review
- NLP prototyping and benchmarking
Not recommended:
- fully automated moderation without human oversight
- legal, medical, civic, or safety-critical decision-making
- standalone fact-checking without external evidence workflows
---
## Limitations and Bias
- English-focused training data; multilingual performance is not guaranteed
- Dataset-derived labels can carry source/style/political bias
- Mixed claim-style and article-style supervision can create domain-shift effects
- Performance may degrade on niche misinformation domains
- Confidence scores are not factual certainty
- Model outputs should support, not replace, human fact-checkers
---
## Ethical Use
This model should be used as an assistive signal, not an autonomous truth system.
Predictions should be reviewed with evidence retrieval, source validation, and human judgment.
---
## Author and Versioning
- Author: Pushkar Kumar
- Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
- Current release: Veritas AI v2