Update README.md

4f62dd1 verified 25 days ago

6.64 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-classification
	base_model: allenai/longformer-base-4096
	tags:
	- text-classification
	- longformer
	- fake-news-detection
	- misinformation-detection
	- news-classification
	- multi-dataset
	- vertex-ai
	- pytorch
	- transformers
	---

	# Veritas AI v2: Multi-Dataset Fake News and Misinformation Classifier

	Version: 2.0
	Previous version: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)

	Veritas AI v2 is a long-context binary classifier fine-tuned from allenai/longformer-base-4096 to classify content as REAL or FAKE.
	This version is a major upgrade over v1, moving from single-source training to multi-dataset training for stronger cross-domain robustness.

	---

	## Why v2 Is a Major Upgrade

	This release reflects a full production-style training effort:

	- Multi-dataset training pipeline with unified label mapping
	- Long-context architecture for article-length text
	- Distributed training orchestration on Vertex AI
	- Reliability-focused artifact save strategy
	- Metric-based checkpoint selection using weighted F1
	- Early stopping for better generalization
	- Hardened cloud training flow for long runs

	---

	## Model Overview

	- Base model: allenai/longformer-base-4096
	- Task: Binary text classification
	- Labels:
	- 0 = REAL
	- 1 = FAKE
	- Max sequence length: 1024
	- Approximate parameter count: about 149M
	- Framework stack:
	- Hugging Face Transformers Trainer
	- PyTorch
	- Accelerate
	- Training platform: Google Cloud Vertex AI

	---

	## Training Data

	This model was trained on a merged corpus from:

	- ISOT Fake News Dataset
	- True.csv
	- Fake.csv
	- LIAR
	- train.tsv
	- valid.tsv
	- FEVER
	- train.jsonl

	Language: English

	### Label Harmonization

	A consistent binary mapping was applied across all sources:

	- ISOT:
	- True.csv -> 0
	- Fake.csv -> 1
	- LIAR:
	- false, barely-true, pants-fire -> 1
	- all remaining LIAR labels -> 0
	- FEVER:
	- SUPPORTS -> 0
	- REFUTES -> 1
	- NOT ENOUGH INFO excluded

	### Text Construction

	- ISOT input text: title + text
	- LIAR input text: statement + speaker
	- FEVER input text: claim

	### Data Processing

	- Unified schema to fulltext and label
	- Dropped empty and trivial text rows
	- Merged all sources into one corpus
	- Shuffled with seed 42
	- Train/test split: 90/10 with seed 42

	---

	## Tokenization and Longformer Attention

	Tokenizer:
	- AutoTokenizer from allenai/longformer-base-4096

	Tokenization config:
	- padding: max_length
	- truncation: true
	- max_length: 1024

	Global attention mask:
	- first token set to 1
	- all remaining tokens set to 0

	This global-attention setup is applied in both training and inference.

	---

	## Training Configuration

	Model initialization:

	from transformers import AutoModelForSequenceClassification

	model = AutoModelForSequenceClassification.from_pretrained(
	"allenai/longformer-base-4096",
	num_labels=2,
	)

	Training arguments used for v2:

	- evaluation_strategy: epoch
	- save_strategy: epoch
	- learning_rate: 2e-5
	- per_device_train_batch_size: 8
	- per_device_eval_batch_size: 8
	- gradient_accumulation_steps: 2
	- num_train_epochs: 3
	- warmup_ratio: 0.06
	- weight_decay: 0.01
	- lr_scheduler_type: cosine
	- label_smoothing_factor: 0.1
	- fp16: true
	- tf32: true
	- gradient_checkpointing: false
	- load_best_model_at_end: true
	- metric_for_best_model: f1
	- early_stopping_patience: 2
	- save_total_limit: 2
	- push_to_hub: false
	- report_to: none
	- logging_strategy: steps
	- logging_steps: 10
	- ddp_find_unused_parameters: false

	---

	## Evaluation

	Metrics computed during validation:
	- accuracy
	- weighted F1

	Best checkpoint selection:
	- weighted F1

	You can optionally append final run stats from trainer logs:
	- global steps
	- training runtime
	- final training loss
	- final validation loss
	- final accuracy
	- final weighted F1

	---

	## Reliability and Engineering Notes

	This project includes reliability safeguards for long cloud runs:

	- Distributed launch through Accelerate
	- Rank-aware preprocessing to avoid cache write collisions
	- Explicit distributed process-group cleanup to avoid NCCL warnings
	- Multi-destination save strategy:
	- Vertex model output path
	- primary GCS path
	- timestamped backup GCS path
	- local backup copy
	- Upload retry logic with verification checks

	These controls were added to avoid silent artifact-loss failures after long training jobs.

	---

	## Inference Example

	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "PushkarKumar/veritas_ai_v2"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	model.eval()

	id2label = {0: "REAL", 1: "FAKE"}

	def classify(text: str):
	inputs = tokenizer(
	text,
	padding="max_length",
	truncation=True,
	max_length=1024,
	return_tensors="pt",
	)

	global_attention_mask = torch.zeros_like(inputs["input_ids"])
	global_attention_mask[:, 0] = 1
	inputs["global_attention_mask"] = global_attention_mask

	with torch.no_grad():
	outputs = model(**inputs)

	probs = torch.softmax(outputs.logits, dim=-1)
	pred_id = int(torch.argmax(probs, dim=-1).item())

	return {
	"label": id2label[pred_id],
	"score": float(probs[0, pred_id]),
	}

	---

	## Intended Use

	Recommended:
	- misinformation research
	- content triage with human review
	- NLP prototyping and benchmarking

	Not recommended:
	- fully automated moderation without human oversight
	- legal, medical, civic, or safety-critical decision-making
	- standalone fact-checking without external evidence workflows

	---

	## Limitations and Bias

	- English-focused training data; multilingual performance is not guaranteed
	- Dataset-derived labels can carry source/style/political bias
	- Mixed claim-style and article-style supervision can create domain-shift effects
	- Performance may degrade on niche misinformation domains
	- Confidence scores are not factual certainty
	- Model outputs should support, not replace, human fact-checkers

	---

	## Ethical Use

	This model should be used as an assistive signal, not an autonomous truth system.
	Predictions should be reviewed with evidence retrieval, source validation, and human judgment.

	---

	## Author and Versioning

	- Author: Pushkar Kumar
	- Previous release: [PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new)
	- Current release: Veritas AI v2