General Fake News Detector (5M)

Short description A RoBERTa-base model adapted and fine-tuned on a large curated dataset (~5M samples) of news and article-level examples for binary fake vs real classification. Built for long-form news verification and to serve as the domain-adaptive foundation for downstream models (e.g., LIAR political fact-checker).

Model repository: Arko007/fake-news-roberta-5M

Model snapshot / overview

Base model: RoBERTa-base (125M parameters)
Task: Binary classification — FAKE vs REAL (article-level)
Domain: News articles, long-form content, broad topical coverage (news, science, health)
Primary use-cases: news verification, content moderation, fact-checking pipelines

Key performance (reported / validation)

Validation accuracy (checkpoint-14000): 99.28%
Validation F1: 99.30%
Expected final validation: ~99%+

Notes:

Validation distribution highly imbalanced (FAKE ~2.8%, REAL ~97.2%).
High scores are consistent with prior models trained on ISOT-style datasets; confirm generalization on stronger OOD tests.

Training & fine-tuning pipeline

Two-stage approach:

RoBERTa pretraining (official)
Fine-tuned on ~5M curated news samples (mixture of ISOT, Kaggle corpora, and other public sources)

Training samples & splits:

Training: ~4.5M samples (ongoing training run reported)
Validation: held-out set (class distribution: FAKE 2.8%, REAL 97.2%)
Checkpoint example: step 14,000 (validation acc 99.28%)

Training hyperparameters (example run):

Optimizer: AdamW
Learning rate: 2e-5
Batch size: 8
Total steps: ~94,466 (1 epoch)
Training precision: BF16 mixed precision
Hardware: NVIDIA L4 (24 GB VRAM)
Time: ~24 hours for one full epoch

Notes:

Gradient accumulation used to control effective batch size.
Cosine scheduling and class-weighted loss were used to address imbalance.
Auto-upload checkpoints to Hugging Face used in the pipeline.

Data sources & provenance

Primary sources include (non-exhaustive):

ISOT Fake News Dataset (major component)
Multiple Kaggle fake-news datasets
Other public news/fake-news corpora
Curated scraping / aggregation to reach ~5M samples

Data caveats:

Large class imbalance (dominant REAL class).
ISOT-like datasets are known to be relatively easier (may inflate metrics).
Ensure legal review before redistribution of scraped or 3rd-party content.

Evaluation & generalization

The model transfers to downstream LIAR task (after fine-tuning), yielding LIAR final test performance ~71% when used as a starting point.
High validation metrics on in-domain news; evaluate thoroughly on OOD, short-statement tasks, and tougher fact-check benchmarks before deployment.

Advantages & limitations

Advantages:

Works on full news articles (long context)
Strong validation accuracy on in-domain distribution
Good foundation for domain-adaptive transfer learning

Limitations:

Class imbalance: FAKE is under-represented
Primary sources include ISOT, which can be "easier" than real-world adversarial fake news
May perform poorly on short, context-free claims (use LIAR model for short statements)
English-only

Intended uses & cautions

Appropriate uses:

Research and large-scale content triage for news sites
Feature in human-in-loop fact-checking pipelines
Starting checkpoint for specialized fine-tuning (political, medical, scientific)

Cautions:

Avoid using as the only signal for content takedown or high-stakes decisions
Evaluate and mitigate bias (topic / publisher skew)
Respect dataset licenses and web-source policies

Usage example

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_id = "Arko007/fake-news-roberta-5M"  # replace with HF repo id when uploaded
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

clf = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)  # device=-1 for CPU

article = """<paste long news article text here>"""
result = clf(article, truncation=True, max_length=4096)
print(result)

Reproducibility & checkpoints

Training scripts in the repository contain:

preprocessing steps (tokenization, article truncation)
class-weighting code
checkpointing & auto-upload logic
hyperparameter and schedule definitions

Citation

Suggested model citation:

@misc{fake-news-5m-2025,
  title = {General Fake News Detector (5M)},
  author = {Arko007},
  year = {2025},
  howpublished = {Hugging Face model hub: Arko007/fake-news-roberta-5M},
  note = {RoBERTa-base fine-tuned on a 5M-sample curated news corpus}
}

Also cite ISOT dataset and other upstream sources per their instructions.

Contact & maintainer

Maintainer: Arko007 (https://huggingface.co/Arko007) Repository: https://github.com/Arko007/fake-news-roberta-5M

If you find licensing issues, provenance problems, or unsafe outputs, please open an issue in the repo.

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32