# ML Pipeline — Research Document Archive ## Overview Machine learning pipeline for analyzing 234K declassified government documents across 7 collections. Extracts dates, correlates documents with historical crises, classifies topics, detects redactions, and builds entity networks. ## Prerequisites - Python venv: `/opt/epstein_env/` (torch, spacy, transformers, sentence-transformers) - PostgreSQL: `epstein_research` database - RunPod: GPU-heavy tasks (zero-shot classification, BERTopic) ## Pipeline Scripts | Script | Purpose | Runs On | Status | |--------|---------|---------|--------| | `01_extract_dates.py` | Extract dates from filenames, congress sessions, NER | Hetzner CPU | Done | | `02_seed_events.py` | Seed 20 historical events | Hetzner CPU | Done | | `03_correlate_crises.py` | Multi-signal crisis correlation | Hetzner CPU | Done | | `04_export_for_topics.py` | Export JSONL for GPU classification | Hetzner CPU | Done | | `05_import_topics.py` | Import topic results from RunPod | Hetzner CPU | Ready | | `06_extract_keywords.py` | TF-IDF keyword extraction | Hetzner CPU | TODO | | `07_detect_redactions.py` | OpenCV redaction detection | Hetzner CPU | TODO | | `08_find_duplicates.py` | Embedding-based dedup | Hetzner SQL | TODO | | `09_entity_networks.py` | Co-occurrence & graph analysis | Hetzner CPU | TODO | ## Running ### Full pipeline (stages 1-3) ```bash tmux new-session -s ml-pipeline "/var/www/research/ml/run_pipeline.sh" ``` ### Individual scripts ```bash cd /var/www/research/ml /opt/epstein_env/bin/python3 01_extract_dates.py ``` ### GPU tasks (RunPod) Export data, transfer to RunPod, run, transfer results back: ```bash /opt/epstein_env/bin/python3 04_export_for_topics.py # scp topic_export.jsonl to RunPod # Run classify_fast.py on RunPod # scp topic_results.jsonl back /opt/epstein_env/bin/python3 05_import_topics.py ``` ## Configuration - `config.py` — DB credentials, topic labels, historical events, congress dates - `db.py` — Database connection helpers - `schema.sql` — Table definitions for ML pipeline ## Results (as of 2026-03-06) - 234K documents dated (181K from congress, 36K from folders, 9K from NER) - 20 historical events seeded - 296K document-event correlations - Topic classification: processing on RunPod