File size: 4,767 Bytes
77431a0
 
 
 
 
 
 
96a86ec
77431a0
 
 
86b932c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: TruthLens
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.31.0
python_version: 3.10.13
app_file: app.py
pinned: false
---
# TruthLens: Advanced Fake News Detection Pipeline

TruthLens is an end-to-end fake news detection system that moves beyond simple machine learning probabilities. It employs a robust **5-signal weighted scoring framework** built on journalistic standards, combining deep learning models (DistilBERT, RoBERTa), sequence models (LSTM), statistical models (Logistic Regression), and heuristic analysis to deliver explainable verdicts.

## 🌟 Key Features

*   **5-Signal Scoring Framework:**
    *   **Source Credibility (30%):** Evaluates outlet reputation, author presence, and source corroboration, including typosquatting checks.
    *   **Claim Verification (30%):** Combines AI probability with spaCy-based Named Entity Recognition (NER) and quote attribution analysis.
    *   **Linguistic Quality (20%):** Detects sensationalism, superlatives, passive voice, and uses DistilBERT to check if the headline contradicts the body.
    *   **Freshness (10%):** Contextual and date-based temporal scoring to detect outdated information.
    *   **AI Model Consensus (10%):** Ensemble voting from Logistic Regression, LSTM, DistilBERT, and RoBERTa.
*   **Adversarial Guardrails:** Hard caps and overrides for highly suspicious patterns (Triple Anonymity, Uncited Statistics, Headline Contradictions).
*   **Live Web Corroboration:** RAG (Retrieval-Augmented Generation) pipeline using live search to verify unambiguous claims.
*   **TruthLens UI:** A sleek, dark/light mode adaptable Streamlit dashboard providing detailed explainability down to the specific signals and deductions.

---

## πŸ“ Project Structure

```text
fake_news_detection/
β”œβ”€β”€ app.py                     # Streamlit frontend (TruthLens UI)
β”œβ”€β”€ run_pipeline.py            # Main script to run pipeline stages
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ stage1_ingestion.py    # Downloads and prepares datasets
β”‚   β”œβ”€β”€ stage2_preprocessing.py# Cleans text, tokenizes, and saves artifacts
β”‚   β”œβ”€β”€ stage3_training.py     # Trains models (LR, LSTM, DistilBERT, RoBERTa)
β”‚   β”œβ”€β”€ stage4_inference.py    # The 5-signal scoring engine and prediction logic
β”‚   └── utils/
β”‚       └── rag_retrieval.py   # Live web search corroboration functions
β”œβ”€β”€ data/                      # Raw and processed datasets (created during execution)
└── models/                    # Trained models and vectorizers (created during execution)
```

---

## πŸš€ Getting Started

### 1. Installation

Ensure you have Python 3.8+ installed. Install the required dependencies:

```bash
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```

### 2. Running the Pipeline

The project is divided into stages. You can run the entire pipeline end-to-end, or run specific stages individually using `run_pipeline.py`.

**To run the complete training pipeline (Stages 1 to 3):**
*Note: This will download datasets, preprocess them, and train all models. It may take a significant amount of time depending on your hardware.*

```bash
python run_pipeline.py --stage 1 2 3
```

**To run individual stages:**

*   **Stage 1: Data Ingestion**
    Downloads and formats the necessary datasets (e.g., LIAR, ISOT).
    ```bash
    python run_pipeline.py --stage 1
    ```

*   **Stage 2: Preprocessing**
    Cleans the text, maps verdicts to binary labels, and prepares DataFrames for training.
    ```bash
    python run_pipeline.py --stage 2
    ```

*   **Stage 3: Training**
    Trains the ensemble: Logistic Regression, LSTM, DistilBERT, and RoBERTa. Saves the models to the `/models` directory.
    ```bash
    python run_pipeline.py --stage 3
    ```

*   **Stage 4: Evaluation**
    Evaluates the trained pipeline on the holdout test set using the 5-signal inference framework.
    ```bash
    python run_pipeline.py --eval
    ```

---

## πŸ–₯️ Running the Application

Once the models are trained (or if you already have the pre-trained weights in the `/models` directory), you can launch the TruthLens UI.

```bash
python -m streamlit run app.py
```

This will start a local web server (usually at `http://localhost:8501`). 

### Using the App:
1.  **Paste text or provide a URL:** You can paste the raw text of an article (with or without a headline) or simply provide a URL for the app to parse automatically.
2.  **Select depth:** Choose Quick, Standard, or Deep analysis.
3.  **View Results:** Explore the four-tier verdict (True, Uncertain, Likely False, False), signal breakdown, adversarial flags, and live web corroboration results.