| # ML Pipeline — Research Document Archive |
|
|
| ## Overview |
| Machine learning pipeline for analyzing 234K declassified government documents across 7 collections. Extracts dates, correlates documents with historical crises, classifies topics, detects redactions, and builds entity networks. |
|
|
| ## Prerequisites |
| - Python venv: `/opt/epstein_env/` (torch, spacy, transformers, sentence-transformers) |
| - PostgreSQL: `epstein_research` database |
| - RunPod: GPU-heavy tasks (zero-shot classification, BERTopic) |
|
|
| ## Pipeline Scripts |
|
|
| | Script | Purpose | Runs On | Status | |
| |--------|---------|---------|--------| |
| | `01_extract_dates.py` | Extract dates from filenames, congress sessions, NER | Hetzner CPU | Done | |
| | `02_seed_events.py` | Seed 20 historical events | Hetzner CPU | Done | |
| | `03_correlate_crises.py` | Multi-signal crisis correlation | Hetzner CPU | Done | |
| | `04_export_for_topics.py` | Export JSONL for GPU classification | Hetzner CPU | Done | |
| | `05_import_topics.py` | Import topic results from RunPod | Hetzner CPU | Ready | |
| | `06_extract_keywords.py` | TF-IDF keyword extraction | Hetzner CPU | TODO | |
| | `07_detect_redactions.py` | OpenCV redaction detection | Hetzner CPU | TODO | |
| | `08_find_duplicates.py` | Embedding-based dedup | Hetzner SQL | TODO | |
| | `09_entity_networks.py` | Co-occurrence & graph analysis | Hetzner CPU | TODO | |
|
|
| ## Running |
|
|
| ### Full pipeline (stages 1-3) |
| ```bash |
| tmux new-session -s ml-pipeline "/var/www/research/ml/run_pipeline.sh" |
| ``` |
|
|
| ### Individual scripts |
| ```bash |
| cd /var/www/research/ml |
| /opt/epstein_env/bin/python3 01_extract_dates.py |
| ``` |
|
|
| ### GPU tasks (RunPod) |
| Export data, transfer to RunPod, run, transfer results back: |
| ```bash |
| /opt/epstein_env/bin/python3 04_export_for_topics.py |
| # scp topic_export.jsonl to RunPod |
| # Run classify_fast.py on RunPod |
| # scp topic_results.jsonl back |
| /opt/epstein_env/bin/python3 05_import_topics.py |
| ``` |
|
|
| ## Configuration |
| - `config.py` — DB credentials, topic labels, historical events, congress dates |
| - `db.py` — Database connection helpers |
| - `schema.sql` — Table definitions for ML pipeline |
|
|
| ## Results (as of 2026-03-06) |
| - 234K documents dated (181K from congress, 36K from folders, 9K from NER) |
| - 20 historical events seeded |
| - 296K document-event correlations |
| - Topic classification: processing on RunPod |
|
|