PushkarKumar
/

veritas_ai_v2

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- allenai/longformer-base-4096
+pipeline_tag: text-classification
+tags:
+- longformer
+- fake-news-detection
+- news
+- misinformation
+- multi-dataset
+---
+# Veritas AI v2 — Multi-Dataset Fake News & Misinformation Classifier (Longformer)
+> **Version:** 2.0  |  **Previous version:** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
+A binary text-classification model that fine-tunes `allenai/longformer-base-4096` to classify long-form news articles as **REAL** or **FAKE**. This is an upgraded version of `veritas_ai_new`, retrained on a significantly larger and more diverse multi-dataset combination to improve generalization and robustness beyond a single news domain.
+---
+## Model
+- **Base model:** `allenai/longformer-base-4096`
+- **Task:** Binary text classification (REAL / FAKE)
+- **Labels:** `0` = REAL, `1` = FAKE
+- **Max sequence length used:** 1024 tokens
+- **Parameters:** ~0.1B (same architecture as `longformer-base-4096` with a newly initialized 2-class classifier head)
+- **Framework:** Hugging Face `transformers` (Trainer API)
+- **Training platform:** Google Cloud Platform (Vertex AI)
+---
+## What's New in v2
+- Trained on **multiple datasets** (multi-source) instead of only the ISOT Fake News Dataset used in v1
+- Larger and more diverse training corpus for improved cross-domain generalization
+- Additional preprocessing and dataset-balancing steps applied
+- *(Further changelog details to be added)*
+---
+## Data
+- **Datasets:** *(To be filled — list all datasets used)*
+- **Languages:** English
+- **Preprocessing:**
+  - Added `label` column: `0` for REAL, `1` for FAKE
+  - Concatenated `title` and `text` into `full_text`
+  - Shuffled combined data with `random_state=42`
+  - Multi-dataset merging and deduplication applied
+  - Train/test split: 80% / 20%, stratified by `label`
+- **Dataset statistics:** *(To be filled — total examples, label distribution)*
+---
+## Tokenization
+- **Tokenizer:** `AutoTokenizer.from_pretrained("allenai/longformer-base-4096")`
+- **Settings:**
+  - `padding="max_length"`
+  - `truncation=True`
+  - `max_length=1024`
+- **Global attention mask:** First token (`[CLS]`) set to 1, rest 0 — applied during both training and inference
+---
+## Training Setup
+**Model init**
+```python
+model = AutoModelForSequenceClassification.from_pretrained(
+    "allenai/longformer-base-4096",
+    num_labels=2,
+)
+```
+**TrainingArguments**
+- `evaluation_strategy` = `"epoch"`
+- `save_strategy` = `"epoch"`
+- `learning_rate` = `2e-5`
+- `per_device_train_batch_size` = `1`
+- `per_device_eval_batch_size` = `1`
+- `gradient_accumulation_steps` = `4`
+- `num_train_epochs` = *(To be filled)*
+- `weight_decay` = `0.01`
+- `fp16` = `True`
+- `gradient_checkpointing` = `True`
+- `load_best_model_at_end` = `True`
+- `push_to_hub` = `False`
+- `report_to` = `"none"`
+---
+## Training and Evaluation
+- **Epochs:** *(To be filled)*
+- **Global steps:** *(To be filled)*
+- **Training runtime:** *(To be filled)*
+- **Losses:**
+  - Training loss: *(To be filled)*
+  - Validation loss: *(To be filled)*
+- **Metrics:** *(To be filled — accuracy, F1, precision, recall if computed)*
+---
+## Inference
+Minimal example for using the model from the Hub:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "PushkarKumar/veritas_ai_v2"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+model.eval()
+def classify(text: str):
+    inputs = tokenizer(
+        text,
+        padding="max_length",
+        truncation=True,
+        max_length=1024,
+        return_tensors="pt",
+    )
+    global_attention_mask = torch.zeros(
+        inputs["input_ids"].shape, dtype=torch.long
+    )
+    global_attention_mask[:, 0] = 1
+    inputs["global_attention_mask"] = global_attention_mask
+    with torch.no_grad():
+        outputs = model(**inputs)
+    probs = torch.softmax(outputs.logits, dim=1)
+    label_id = int(torch.argmax(probs))
+    labels = {0: "REAL", 1: "FAKE"}
+    return labels[label_id], float(probs[0][label_id])
+```
+---
+## Limitations and Bias
+- Trained primarily on English-language news datasets; performance on other languages is not guaranteed.
+- Labels are based on data-source heuristics (e.g., credible outlets vs. unreliable sites), not article-level fact-checking, and may encode source or political bias.
+- While trained on multiple datasets for broader coverage, the model may still underperform on highly specialized or domain-specific misinformation (e.g., scientific misinformation, satire).
+- The model should **not** be used as an automated fact-checker or for high-stakes decisions without human oversight.
+---
+## Author
+- **Author:** Pushkar Kumar
+- **v1 (base):** [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)