--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-classification base_model: allenai/longformer-base-4096 tags: - text-classification - longformer - fake-news-detection - misinformation-detection - news-classification - multi-dataset - vertex-ai - pytorch - transformers --- # Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier Version: 2.0 Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new) Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE. This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness. --- ## Why v2 Is a Major Upgrade This release reflects a full production-style training effort: - Multi-dataset training pipeline with unified label mapping - Long-context architecture for article-length text - Distributed training orchestration on Vertex AI - Reliability-focused artifact save strategy - Metric-based checkpoint selection using weighted F1 - Early stopping for better generalization - Hardened cloud training flow for long runs --- ## Model Overview - Base model: allenai/longformer-base-4096 - Task: Binary text classification - Labels: - 0 = REAL - 1 = FAKE - Max sequence length: 1024 - Approximate parameter count: about 149M - Framework stack: - Hugging Face Transformers Trainer - PyTorch - Accelerate - Training platform: Google Cloud Vertex AI --- ## Training Data This model was trained on a merged corpus from: - ISOT Fake News Dataset - True.csv - Fake.csv - LIAR - train.tsv - valid.tsv - FEVER - train.jsonl Language: English ### Label Harmonization A consistent binary mapping was applied across all sources: - ISOT: - True.csv -> 0 - Fake.csv -> 1 - LIAR: - false, barely-true, pants-fire -> 1 - all remaining LIAR labels -> 0 - FEVER: - SUPPORTS -> 0 - REFUTES -> 1 - NOT ENOUGH INFO excluded ### Text Construction - ISOT input text: title + text - LIAR input text: statement + speaker - FEVER input text: claim ### Data Processing - Unified schema to fulltext and label - Dropped empty and trivial text rows - Merged all sources into one corpus - Shuffled with seed 42 - Train/test split: 90/10 with seed 42 --- ## Tokenization and Longformer Attention Tokenizer: - AutoTokenizer from allenai/longformer-base-4096 Tokenization config: - padding: max_length - truncation: true - max_length: 1024 Global attention mask: - first token set to 1 - all remaining tokens set to 0 This global-attention setup is applied in both training and inference. --- ## Training Configuration Model initialization: from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained( "allenai/longformer-base-4096", num_labels=2, ) Training arguments used for v2: - evaluation_strategy: epoch - save_strategy: epoch - learning_rate: 2e-5 - per_device_train_batch_size: 8 - per_device_eval_batch_size: 8 - gradient_accumulation_steps: 2 - num_train_epochs: 3 - warmup_ratio: 0.06 - weight_decay: 0.01 - lr_scheduler_type: cosine - label_smoothing_factor: 0.1 - fp16: true - tf32: true - gradient_checkpointing: false - load_best_model_at_end: true - metric_for_best_model: f1 - early_stopping_patience: 2 - save_total_limit: 2 - push_to_hub: false - report_to: none - logging_strategy: steps - logging_steps: 10 - ddp_find_unused_parameters: false --- ## Evaluation Metrics computed during validation: - accuracy - weighted F1 Best checkpoint selection: - weighted F1 You can optionally append final run stats from trainer logs: - global steps - training runtime - final training loss - final validation loss - final accuracy - final weighted F1 --- ## Reliability and Engineering Notes This project includes reliability safeguards for long cloud runs: - Distributed launch through Accelerate - Rank-aware preprocessing to avoid cache write collisions - Explicit distributed process-group cleanup to avoid NCCL warnings - Multi-destination save strategy: - Vertex model output path - primary GCS path - timestamped backup GCS path - local backup copy - Upload retry logic with verification checks These controls were added to avoid silent artifact-loss failures after long training jobs. --- ## Inference Example from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "PushkarKumar/veritas_ai_v2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) model.eval() id2label = {0: "REAL", 1: "FAKE"} def classify(text: str): inputs = tokenizer( text, padding="max_length", truncation=True, max_length=1024, return_tensors="pt", ) global_attention_mask = torch.zeros_like(inputs["input_ids"]) global_attention_mask[:, 0] = 1 inputs["global_attention_mask"] = global_attention_mask with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) pred_id = int(torch.argmax(probs, dim=-1).item()) return { "label": id2label[pred_id], "score": float(probs[0, pred_id]), } --- ## Intended Use Recommended: - misinformation research - content triage with human review - NLP prototyping and benchmarking Not recommended: - fully automated moderation without human oversight - legal, medical, civic, or safety-critical decision-making - standalone fact-checking without external evidence workflows --- ## Limitations and Bias - English-focused training data; multilingual performance is not guaranteed - Dataset-derived labels can carry source/style/political bias - Mixed claim-style and article-style supervision can create domain-shift effects - Performance may degrade on niche misinformation domains - Confidence scores are not factual certainty - Model outputs should support, not replace, human fact-checkers --- ## Ethical Use This model should be used as an assistive signal, not an autonomous truth system. Predictions should be reviewed with evidence retrieval, source validation, and human judgment. --- ## Author and Versioning - Author: Pushkar Kumar - Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new) - Current release: Veritas AI v2