Ryan Christian D. Deniega commited on
Commit
9724119
Β·
1 Parent(s): 6c9b8f1

docs: add README

Browse files
Files changed (1) hide show
  1. README.md +174 -0
README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PhilVerify πŸ‡΅πŸ‡­πŸ”
2
+
3
+ **Multimodal fake news detection for Philippine social media.**
4
+
5
+ PhilVerify combines ML-based text classification with evidence retrieval to detect misinformation in Tagalog, English, and Taglish content. It supports text, URL, image (OCR), and video (ASR) inputs.
6
+
7
+ ---
8
+
9
+ ## Features
10
+
11
+ - **4 Input Types** β€” raw text, news URL, image (Tesseract OCR), video/audio (Whisper ASR)
12
+ - **Language-Aware** β€” detects Tagalog / English / Taglish automatically
13
+ - **NLP Pipeline** β€” NER, sentiment, emotion, clickbait detection, claim extraction
14
+ - **Two-Layer Scoring**
15
+ - Layer 1: TF-IDF + Logistic Regression classifier (β†’ fine-tuned XLM-RoBERTa)
16
+ - Layer 2: NewsAPI evidence retrieval + cosine similarity + stance detection
17
+ - **Final Score** = `(ML Γ— 0.40) + (Evidence Γ— 0.60)` β†’ Credible / Unverified / Likely Fake
18
+ - **Philippine Domain Credibility DB** β€” 4-tier system (Rappler Tier 1 β†’ known fake sites Tier 4)
19
+
20
+ ---
21
+
22
+ ## Tech Stack
23
+
24
+ | Layer | Tech |
25
+ |---|---|
26
+ | Backend | FastAPI, Python 3.12, Pydantic v2 |
27
+ | NLP | spaCy, HuggingFace Transformers, langdetect |
28
+ | ML Classifier | scikit-learn (TF-IDF + LogReg β†’ XLM-RoBERTa) |
29
+ | OCR | Tesseract (`fil+eng`) |
30
+ | ASR | OpenAI Whisper |
31
+ | Evidence | NewsAPI, sentence-transformers |
32
+ | Frontend *(planned)* | React, TailwindCSS, Chart.js |
33
+ | Extension *(planned)* | Chrome Manifest V3 |
34
+
35
+ ---
36
+
37
+ ## Project Structure
38
+
39
+ ```
40
+ PhilVerify/
41
+ β”œβ”€β”€ main.py # FastAPI app entry point
42
+ β”œβ”€β”€ config.py # Settings (pydantic-settings)
43
+ β”œβ”€β”€ requirements.txt
44
+ β”œβ”€β”€ .env.example
45
+ β”œβ”€β”€ domain_credibility.json # PH domain tier database
46
+ β”‚
47
+ β”œβ”€β”€ api/
48
+ β”‚ β”œβ”€β”€ schemas.py # Pydantic request/response models
49
+ β”‚ └── routes/
50
+ β”‚ β”œβ”€β”€ verify.py # POST /verify/text|url|image|video
51
+ β”‚ β”œβ”€β”€ history.py # GET /history
52
+ β”‚ └── trends.py # GET /trends
53
+ β”‚
54
+ β”œβ”€β”€ nlp/ # NLP preprocessing pipeline
55
+ β”‚ β”œβ”€β”€ preprocessor.py # Clean, tokenize, remove stopwords (EN+TL)
56
+ β”‚ β”œβ”€β”€ language_detector.py # Tagalog / English / Taglish detection
57
+ β”‚ β”œβ”€β”€ ner.py # Named entity recognition + PH entity hints
58
+ β”‚ β”œβ”€β”€ sentiment.py # Sentiment + emotion analysis
59
+ β”‚ β”œβ”€β”€ clickbait.py # Clickbait pattern detection
60
+ β”‚ └── claim_extractor.py # Extract falsifiable claim for evidence search
61
+ β”‚
62
+ β”œβ”€β”€ ml/
63
+ β”‚ └── tfidf_classifier.py # Layer 1 β€” TF-IDF baseline classifier
64
+ β”‚
65
+ β”œβ”€β”€ evidence/
66
+ β”‚ └── news_fetcher.py # Layer 2 β€” NewsAPI + cosine similarity
67
+ β”‚
68
+ β”œβ”€β”€ scoring/
69
+ β”‚ └── engine.py # Orchestrates full pipeline + final score
70
+ β”‚
71
+ β”œβ”€β”€ inputs/
72
+ β”‚ β”œβ”€β”€ url_scraper.py # BeautifulSoup article extractor
73
+ β”‚ β”œβ”€β”€ ocr.py # Tesseract OCR
74
+ β”‚ └── asr.py # Whisper ASR
75
+ β”‚
76
+ └── tests/
77
+ └── test_philverify.py # 23 unit + integration tests
78
+ ```
79
+
80
+ ---
81
+
82
+ ## Getting Started
83
+
84
+ ### 1. Clone & set up environment
85
+
86
+ ```bash
87
+ git clone https://github.com/SemiAutomat1c/philverify.git
88
+ cd philverify
89
+ python3 -m venv venv
90
+ source venv/bin/activate
91
+ pip install -r requirements.txt
92
+ ```
93
+
94
+ ### 2. Configure environment variables
95
+
96
+ ```bash
97
+ cp .env.example .env
98
+ # Edit .env and add your NEWS_API_KEY (optional but recommended)
99
+ ```
100
+
101
+ ### 3. Run the API
102
+
103
+ ```bash
104
+ uvicorn main:app --reload --port 8000
105
+ ```
106
+
107
+ ### 4. Explore the docs
108
+
109
+ Open **http://localhost:8000/docs** for the interactive Swagger UI.
110
+
111
+ ---
112
+
113
+ ## API Endpoints
114
+
115
+ | Method | Endpoint | Description |
116
+ |---|---|---|
117
+ | `POST` | `/verify/text` | Verify raw text |
118
+ | `POST` | `/verify/url` | Verify a news URL |
119
+ | `POST` | `/verify/image` | Verify an image (OCR) |
120
+ | `POST` | `/verify/video` | Verify audio/video (Whisper ASR) |
121
+ | `GET` | `/history` | Verification history (paginated) |
122
+ | `GET` | `/trends` | Trending fake-news entities & topics |
123
+
124
+ ### Example request
125
+
126
+ ```bash
127
+ curl -X POST http://localhost:8000/verify/text \
128
+ -H "Content-Type: application/json" \
129
+ -d '{"text": "GRABE! Namatay daw ang tatlong tao sa bagong sakit na kumakalat sa Pilipinas!"}'
130
+ ```
131
+
132
+ ### Example response
133
+
134
+ ```json
135
+ {
136
+ "verdict": "Likely Fake",
137
+ "confidence": 82.4,
138
+ "final_score": 34.2,
139
+ "layer1": { "verdict": "Likely Fake", "confidence": 82.4, "triggered_features": ["namatay", "sakit", "kumakalat"] },
140
+ "layer2": { "verdict": "Unverified", "evidence_score": 50.0, "sources": [] },
141
+ "entities": { "persons": [], "organizations": [], "locations": ["Pilipinas"], "dates": [] },
142
+ "sentiment": "high negative",
143
+ "emotion": "fear",
144
+ "language": "Tagalog"
145
+ }
146
+ ```
147
+
148
+ ---
149
+
150
+ ## Running Tests
151
+
152
+ ```bash
153
+ pytest tests/ -v
154
+ # 23 passed in ~1s
155
+ ```
156
+
157
+ ---
158
+
159
+ ## Roadmap
160
+
161
+ - [x] Phase 1 β€” FastAPI backend skeleton
162
+ - [x] Phase 2 β€” NLP preprocessing pipeline
163
+ - [x] Phase 3 β€” TF-IDF baseline classifier
164
+ - [ ] Phase 4 β€” NewsAPI evidence retrieval
165
+ - [ ] Phase 5 β€” Scoring engine refinement (stance detection)
166
+ - [ ] Phase 6 β€” React web dashboard
167
+ - [ ] Phase 7 β€” Chrome Extension (Manifest V3)
168
+ - [ ] Phase 8 β€” Fine-tune XLM-RoBERTa / TLUnified-RoBERTa
169
+
170
+ ---
171
+
172
+ ## License
173
+
174
+ MIT