Text Classification
Transformers
Safetensors
PyTorch
English
longformer
fake-news-detection
misinformation-detection
news-classification
multi-dataset
vertex-ai
Instructions to use PushkarKumar/veritas_ai_v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PushkarKumar/veritas_ai_v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="PushkarKumar/veritas_ai_v2")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("PushkarKumar/veritas_ai_v2") model = AutoModelForSequenceClassification.from_pretrained("PushkarKumar/veritas_ai_v2") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| base_model: allenai/longformer-base-4096 | |
| tags: | |
| - text-classification | |
| - longformer | |
| - fake-news-detection | |
| - misinformation-detection | |
| - news-classification | |
| - multi-dataset | |
| - vertex-ai | |
| - pytorch | |
| - transformers | |
| # Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier | |
| Version: 2.0 | |
| Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new) | |
| Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE. | |
| This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness. | |
| --- | |
| ## Why v2 Is a Major Upgrade | |
| This release reflects a full production-style training effort: | |
| - Multi-dataset training pipeline with unified label mapping | |
| - Long-context architecture for article-length text | |
| - Distributed training orchestration on Vertex AI | |
| - Reliability-focused artifact save strategy | |
| - Metric-based checkpoint selection using weighted F1 | |
| - Early stopping for better generalization | |
| - Hardened cloud training flow for long runs | |
| --- | |
| ## Model Overview | |
| - Base model: allenai/longformer-base-4096 | |
| - Task: Binary text classification | |
| - Labels: | |
| - 0 = REAL | |
| - 1 = FAKE | |
| - Max sequence length: 1024 | |
| - Approximate parameter count: about 149M | |
| - Framework stack: | |
| - Hugging Face Transformers Trainer | |
| - PyTorch | |
| - Accelerate | |
| - Training platform: Google Cloud Vertex AI | |
| --- | |
| ## Training Data | |
| This model was trained on a merged corpus from: | |
| - ISOT Fake News Dataset | |
| - True.csv | |
| - Fake.csv | |
| - LIAR | |
| - train.tsv | |
| - valid.tsv | |
| - FEVER | |
| - train.jsonl | |
| Language: English | |
| ### Label Harmonization | |
| A consistent binary mapping was applied across all sources: | |
| - ISOT: | |
| - True.csv -> 0 | |
| - Fake.csv -> 1 | |
| - LIAR: | |
| - false, barely-true, pants-fire -> 1 | |
| - all remaining LIAR labels -> 0 | |
| - FEVER: | |
| - SUPPORTS -> 0 | |
| - REFUTES -> 1 | |
| - NOT ENOUGH INFO excluded | |
| ### Text Construction | |
| - ISOT input text: title + text | |
| - LIAR input text: statement + speaker | |
| - FEVER input text: claim | |
| ### Data Processing | |
| - Unified schema to fulltext and label | |
| - Dropped empty and trivial text rows | |
| - Merged all sources into one corpus | |
| - Shuffled with seed 42 | |
| - Train/test split: 90/10 with seed 42 | |
| --- | |
| ## Tokenization and Longformer Attention | |
| Tokenizer: | |
| - AutoTokenizer from allenai/longformer-base-4096 | |
| Tokenization config: | |
| - padding: max_length | |
| - truncation: true | |
| - max_length: 1024 | |
| Global attention mask: | |
| - first token set to 1 | |
| - all remaining tokens set to 0 | |
| This global-attention setup is applied in both training and inference. | |
| --- | |
| ## Training Configuration | |
| Model initialization: | |
| from transformers import AutoModelForSequenceClassification | |
| model = AutoModelForSequenceClassification.from_pretrained( | |
| "allenai/longformer-base-4096", | |
| num_labels=2, | |
| ) | |
| Training arguments used for v2: | |
| - evaluation_strategy: epoch | |
| - save_strategy: epoch | |
| - learning_rate: 2e-5 | |
| - per_device_train_batch_size: 8 | |
| - per_device_eval_batch_size: 8 | |
| - gradient_accumulation_steps: 2 | |
| - num_train_epochs: 3 | |
| - warmup_ratio: 0.06 | |
| - weight_decay: 0.01 | |
| - lr_scheduler_type: cosine | |
| - label_smoothing_factor: 0.1 | |
| - fp16: true | |
| - tf32: true | |
| - gradient_checkpointing: false | |
| - load_best_model_at_end: true | |
| - metric_for_best_model: f1 | |
| - early_stopping_patience: 2 | |
| - save_total_limit: 2 | |
| - push_to_hub: false | |
| - report_to: none | |
| - logging_strategy: steps | |
| - logging_steps: 10 | |
| - ddp_find_unused_parameters: false | |
| --- | |
| ## Evaluation | |
| Metrics computed during validation: | |
| - accuracy | |
| - weighted F1 | |
| Best checkpoint selection: | |
| - weighted F1 | |
| You can optionally append final run stats from trainer logs: | |
| - global steps | |
| - training runtime | |
| - final training loss | |
| - final validation loss | |
| - final accuracy | |
| - final weighted F1 | |
| --- | |
| ## Reliability and Engineering Notes | |
| This project includes reliability safeguards for long cloud runs: | |
| - Distributed launch through Accelerate | |
| - Rank-aware preprocessing to avoid cache write collisions | |
| - Explicit distributed process-group cleanup to avoid NCCL warnings | |
| - Multi-destination save strategy: | |
| - Vertex model output path | |
| - primary GCS path | |
| - timestamped backup GCS path | |
| - local backup copy | |
| - Upload retry logic with verification checks | |
| These controls were added to avoid silent artifact-loss failures after long training jobs. | |
| --- | |
| ## Inference Example | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| model_name = "PushkarKumar/veritas_ai_v2" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) | |
| model.eval() | |
| id2label = {0: "REAL", 1: "FAKE"} | |
| def classify(text: str): | |
| inputs = tokenizer( | |
| text, | |
| padding="max_length", | |
| truncation=True, | |
| max_length=1024, | |
| return_tensors="pt", | |
| ) | |
| global_attention_mask = torch.zeros_like(inputs["input_ids"]) | |
| global_attention_mask[:, 0] = 1 | |
| inputs["global_attention_mask"] = global_attention_mask | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| probs = torch.softmax(outputs.logits, dim=-1) | |
| pred_id = int(torch.argmax(probs, dim=-1).item()) | |
| return { | |
| "label": id2label[pred_id], | |
| "score": float(probs[0, pred_id]), | |
| } | |
| --- | |
| ## Intended Use | |
| Recommended: | |
| - misinformation research | |
| - content triage with human review | |
| - NLP prototyping and benchmarking | |
| Not recommended: | |
| - fully automated moderation without human oversight | |
| - legal, medical, civic, or safety-critical decision-making | |
| - standalone fact-checking without external evidence workflows | |
| --- | |
| ## Limitations and Bias | |
| - English-focused training data; multilingual performance is not guaranteed | |
| - Dataset-derived labels can carry source/style/political bias | |
| - Mixed claim-style and article-style supervision can create domain-shift effects | |
| - Performance may degrade on niche misinformation domains | |
| - Confidence scores are not factual certainty | |
| - Model outputs should support, not replace, human fact-checkers | |
| --- | |
| ## Ethical Use | |
| This model should be used as an assistive signal, not an autonomous truth system. | |
| Predictions should be reviewed with evidence retrieval, source validation, and human judgment. | |
| --- | |
| ## Author and Versioning | |
| - Author: Pushkar Kumar | |
| - Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new) | |
| - Current release: Veritas AI v2 |