PushkarKumar
/

veritas_ai_v2

@@ -1,159 +1,276 @@
 ---
 license: apache-2.0
 language:
-- en
-base_model:
-- allenai/longformer-base-4096
 pipeline_tag: text-classification
 tags:
-- longformer
-- fake-news-detection
-- news
-- misinformation
-- multi-dataset
 ---
-# Veritas AI v2 — Multi-Dataset Fake News & Misinformation Classifier (Longformer)
-> **Version:** 2.0  |  **Previous version:** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
-A binary text-classification model that fine-tunes `allenai/longformer-base-4096` to classify long-form news articles as **REAL** or **FAKE**. This is an upgraded version of `veritas_ai_new`, retrained on a significantly larger and more diverse multi-dataset combination to improve generalization and robustness beyond a single news domain.
 ---
-## Model
-- **Base model:** `allenai/longformer-base-4096`
-- **Task:** Binary text classification (REAL / FAKE)
-- **Labels:** `0` = REAL, `1` = FAKE
-- **Max sequence length used:** 1024 tokens
-- **Parameters:** ~0.1B (same architecture as `longformer-base-4096` with a newly initialized 2-class classifier head)
-- **Framework:** Hugging Face `transformers` (Trainer API)
-- **Training platform:** Google Cloud Platform (Vertex AI)
 ---
-## What's New in v2
-- Trained on **multiple datasets** (multi-source) instead of only the ISOT Fake News Dataset used in v1
-- Larger and more diverse training corpus for improved cross-domain generalization
-- Additional preprocessing and dataset-balancing steps applied
-- *(Further changelog details to be added)*
 ---
-## Data
-- **Datasets:** *(To be filled — list all datasets used)*
-- **Languages:** English
-- **Preprocessing:**
-  - Added `label` column: `0` for REAL, `1` for FAKE
-  - Concatenated `title` and `text` into `full_text`
-  - Shuffled combined data with `random_state=42`
-  - Multi-dataset merging and deduplication applied
-  - Train/test split: 80% / 20%, stratified by `label`
-- **Dataset statistics:** *(To be filled — total examples, label distribution)*
 ---
-## Tokenization
-- **Tokenizer:** `AutoTokenizer.from_pretrained("allenai/longformer-base-4096")`
-- **Settings:**
-  - `padding="max_length"`
-  - `truncation=True`
-  - `max_length=1024`
-- **Global attention mask:** First token (`[CLS]`) set to 1, rest 0 — applied during both training and inference
 ---
-## Training Setup
-**Model init**
-```python
-model = AutoModelForSequenceClassification.from_pretrained(
-    "allenai/longformer-base-4096",
-    num_labels=2,
-)
-```
-**TrainingArguments**
-- `evaluation_strategy` = `"epoch"`
-- `save_strategy` = `"epoch"`
-- `learning_rate` = `2e-5`
-- `per_device_train_batch_size` = `1`
-- `per_device_eval_batch_size` = `1`
-- `gradient_accumulation_steps` = `4`
-- `num_train_epochs` = *(To be filled)*
-- `weight_decay` = `0.01`
-- `fp16` = `True`
-- `gradient_checkpointing` = `True`
-- `load_best_model_at_end` = `True`
-- `push_to_hub` = `False`
-- `report_to` = `"none"`
 ---
-## Training and Evaluation
-- **Epochs:** *(To be filled)*
-- **Global steps:** *(To be filled)*
-- **Training runtime:** *(To be filled)*
-- **Losses:**
-  - Training loss: *(To be filled)*
-  - Validation loss: *(To be filled)*
-- **Metrics:** *(To be filled — accuracy, F1, precision, recall if computed)*
 ---
-## Inference
-Minimal example for using the model from the Hub:
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-model_name = "PushkarKumar/veritas_ai_v2"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForSequenceClassification.from_pretrained(model_name)
-model.eval()
-def classify(text: str):
-    inputs = tokenizer(
-        text,
-        padding="max_length",
-        truncation=True,
-        max_length=1024,
-        return_tensors="pt",
-    )
-    global_attention_mask = torch.zeros(
-        inputs["input_ids"].shape, dtype=torch.long
-    )
-    global_attention_mask[:, 0] = 1
-    inputs["global_attention_mask"] = global_attention_mask
-    with torch.no_grad():
-        outputs = model(**inputs)
-    probs = torch.softmax(outputs.logits, dim=1)
-    label_id = int(torch.argmax(probs))
-    labels = {0: "REAL", 1: "FAKE"}
-    return labels[label_id], float(probs[0][label_id])
-```
 ---
 ## Limitations and Bias
-- Trained primarily on English-language news datasets; performance on other languages is not guaranteed.
-- Labels are based on data-source heuristics (e.g., credible outlets vs. unreliable sites), not article-level fact-checking, and may encode source or political bias.
-- While trained on multiple datasets for broader coverage, the model may still underperform on highly specialized or domain-specific misinformation (e.g., scientific misinformation, satire).
-- The model should **not** be used as an automated fact-checker or for high-stakes decisions without human oversight.
 ---
-## Author
-- **Author:** Pushkar Kumar
-- **v1 (base):** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)

 ---
 license: apache-2.0
 language:
+  - en
+library_name: transformers
 pipeline_tag: text-classification
+base_model: allenai/longformer-base-4096
 tags:
+  - text-classification
+  - longformer
+  - fake-news-detection
+  - misinformation-detection
+  - news-classification
+  - multi-dataset
+  - vertex-ai
+  - pytorch
+  - transformers
 ---
+# Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier
+Version: 2.0
+Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
+Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
+This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.
 ---
+## Why v2 Is a Major Upgrade
+This release reflects a full production-style training effort:
+- Multi-dataset training pipeline with unified label mapping
+- Long-context architecture for article-length text
+- Distributed training orchestration on Vertex AI
+- Reliability-focused artifact save strategy
+- Metric-based checkpoint selection using weighted F1
+- Early stopping for better generalization
+- Hardened cloud training flow for long runs
+---
+## Model Overview
+- Base model: allenai/longformer-base-4096
+- Task: Binary text classification
+- Labels:
+  - 0 = REAL
+  - 1 = FAKE
+- Max sequence length: 1024
+- Approximate parameter count: about 149M
+- Framework stack:
+  - Hugging Face Transformers Trainer
+  - PyTorch
+  - Accelerate
+- Training platform: Google Cloud Vertex AI
 ---
+## Training Data
+This model was trained on a merged corpus from:
+- ISOT Fake News Dataset
+  - True.csv
+  - Fake.csv
+- LIAR
+  - train.tsv
+  - valid.tsv
+- FEVER
+  - train.jsonl
+Language: English
+### Label Harmonization
+A consistent binary mapping was applied across all sources:
+- ISOT:
+  - True.csv -> 0
+  - Fake.csv -> 1
+- LIAR:
+  - false, barely-true, pants-fire -> 1
+  - all remaining LIAR labels -> 0
+- FEVER:
+  - SUPPORTS -> 0
+  - REFUTES -> 1
+  - NOT ENOUGH INFO excluded
+### Text Construction
+- ISOT input text: title + text
+- LIAR input text: statement + speaker
+- FEVER input text: claim
+### Data Processing
+- Unified schema to fulltext and label
+- Dropped empty and trivial text rows
+- Merged all sources into one corpus
+- Shuffled with seed 42
+- Train/test split: 90/10 with seed 42
 ---
+## Tokenization and Longformer Attention
+Tokenizer:
+- AutoTokenizer from allenai/longformer-base-4096
+Tokenization config:
+- padding: max_length
+- truncation: true
+- max_length: 1024
+Global attention mask:
+- first token set to 1
+- all remaining tokens set to 0
+This global-attention setup is applied in both training and inference.
 ---
+## Training Configuration
+Model initialization:
+    from transformers import AutoModelForSequenceClassification
+    model = AutoModelForSequenceClassification.from_pretrained(
+        "allenai/longformer-base-4096",
+        num_labels=2,
+    )
+Training arguments used for v2:
+- evaluation_strategy: epoch
+- save_strategy: epoch
+- learning_rate: 2e-5
+- per_device_train_batch_size: 8
+- per_device_eval_batch_size: 8
+- gradient_accumulation_steps: 2
+- num_train_epochs: 3
+- warmup_ratio: 0.06
+- weight_decay: 0.01
+- lr_scheduler_type: cosine
+- label_smoothing_factor: 0.1
+- fp16: true
+- tf32: true
+- gradient_checkpointing: false
+- load_best_model_at_end: true
+- metric_for_best_model: f1
+- early_stopping_patience: 2
+- save_total_limit: 2
+- push_to_hub: false
+- report_to: none
+- logging_strategy: steps
+- logging_steps: 10
+- ddp_find_unused_parameters: false
 ---
+## Evaluation
+Metrics computed during validation:
+- accuracy
+- weighted F1
+Best checkpoint selection:
+- weighted F1
+You can optionally append final run stats from trainer logs:
+- global steps
+- training runtime
+- final training loss
+- final validation loss
+- final accuracy
+- final weighted F1
 ---
+## Reliability and Engineering Notes
+This project includes reliability safeguards for long cloud runs:
+- Distributed launch through Accelerate
+- Rank-aware preprocessing to avoid cache write collisions
+- Explicit distributed process-group cleanup to avoid NCCL warnings
+- Multi-destination save strategy:
+  - Vertex model output path
+  - primary GCS path
+  - timestamped backup GCS path
+  - local backup copy
+- Upload retry logic with verification checks
+These controls were added to avoid silent artifact-loss failures after long training jobs.
 ---
+## Inference Example
+    from transformers import AutoTokenizer, AutoModelForSequenceClassification
+    import torch
+    model_name = "PushkarKumar/veritas_ai_v2"
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    model = AutoModelForSequenceClassification.from_pretrained(model_name)
+    model.eval()
+    id2label = {0: "REAL", 1: "FAKE"}
+    def classify(text: str):
+        inputs = tokenizer(
+            text,
+            padding="max_length",
+            truncation=True,
+            max_length=1024,
+            return_tensors="pt",
+        )
+        global_attention_mask = torch.zeros_like(inputs["input_ids"])
+        global_attention_mask[:, 0] = 1
+        inputs["global_attention_mask"] = global_attention_mask
+        with torch.no_grad():
+            outputs = model(**inputs)
+        probs = torch.softmax(outputs.logits, dim=-1)
+        pred_id = int(torch.argmax(probs, dim=-1).item())
+        return {
+            "label": id2label[pred_id],
+            "score": float(probs[0, pred_id]),
+        }
+---
+## Intended Use
+Recommended:
+- misinformation research
+- content triage with human review
+- NLP prototyping and benchmarking
+Not recommended:
+- fully automated moderation without human oversight
+- legal, medical, civic, or safety-critical decision-making
+- standalone fact-checking without external evidence workflows
 ---
 ## Limitations and Bias
+- English-focused training data; multilingual performance is not guaranteed
+- Dataset-derived labels can carry source/style/political bias
+- Mixed claim-style and article-style supervision can create domain-shift effects
+- Performance may degrade on niche misinformation domains
+- Confidence scores are not factual certainty
+- Model outputs should support, not replace, human fact-checkers
+---
+## Ethical Use
+This model should be used as an assistive signal, not an autonomous truth system.
+Predictions should be reviewed with evidence retrieval, source validation, and human judgment.
 ---
+## Author and Versioning
+- Author: Pushkar Kumar
+- Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
+- Current release: Veritas AI v2