datamatters24's picture
Upload ml/README.md with huggingface_hub
e9bd258 verified
# ML Pipeline — Research Document Archive
## Overview
Machine learning pipeline for analyzing 234K declassified government documents across 7 collections. Extracts dates, correlates documents with historical crises, classifies topics, detects redactions, and builds entity networks.
## Prerequisites
- Python venv: `/opt/epstein_env/` (torch, spacy, transformers, sentence-transformers)
- PostgreSQL: `epstein_research` database
- RunPod: GPU-heavy tasks (zero-shot classification, BERTopic)
## Pipeline Scripts
| Script | Purpose | Runs On | Status |
|--------|---------|---------|--------|
| `01_extract_dates.py` | Extract dates from filenames, congress sessions, NER | Hetzner CPU | Done |
| `02_seed_events.py` | Seed 20 historical events | Hetzner CPU | Done |
| `03_correlate_crises.py` | Multi-signal crisis correlation | Hetzner CPU | Done |
| `04_export_for_topics.py` | Export JSONL for GPU classification | Hetzner CPU | Done |
| `05_import_topics.py` | Import topic results from RunPod | Hetzner CPU | Ready |
| `06_extract_keywords.py` | TF-IDF keyword extraction | Hetzner CPU | TODO |
| `07_detect_redactions.py` | OpenCV redaction detection | Hetzner CPU | TODO |
| `08_find_duplicates.py` | Embedding-based dedup | Hetzner SQL | TODO |
| `09_entity_networks.py` | Co-occurrence & graph analysis | Hetzner CPU | TODO |
## Running
### Full pipeline (stages 1-3)
```bash
tmux new-session -s ml-pipeline "/var/www/research/ml/run_pipeline.sh"
```
### Individual scripts
```bash
cd /var/www/research/ml
/opt/epstein_env/bin/python3 01_extract_dates.py
```
### GPU tasks (RunPod)
Export data, transfer to RunPod, run, transfer results back:
```bash
/opt/epstein_env/bin/python3 04_export_for_topics.py
# scp topic_export.jsonl to RunPod
# Run classify_fast.py on RunPod
# scp topic_results.jsonl back
/opt/epstein_env/bin/python3 05_import_topics.py
```
## Configuration
- `config.py` — DB credentials, topic labels, historical events, congress dates
- `db.py` — Database connection helpers
- `schema.sql` — Table definitions for ML pipeline
## Results (as of 2026-03-06)
- 234K documents dated (181K from congress, 36K from folders, 9K from NER)
- 20 historical events seeded
- 296K document-event correlations
- Topic classification: processing on RunPod