ML Pipeline — Research Document Archive
Overview
Machine learning pipeline for analyzing 234K declassified government documents across 7 collections. Extracts dates, correlates documents with historical crises, classifies topics, detects redactions, and builds entity networks.
Prerequisites
- Python venv:
/opt/epstein_env/(torch, spacy, transformers, sentence-transformers) - PostgreSQL:
epstein_researchdatabase - RunPod: GPU-heavy tasks (zero-shot classification, BERTopic)
Pipeline Scripts
| Script | Purpose | Runs On | Status |
|---|---|---|---|
01_extract_dates.py |
Extract dates from filenames, congress sessions, NER | Hetzner CPU | Done |
02_seed_events.py |
Seed 20 historical events | Hetzner CPU | Done |
03_correlate_crises.py |
Multi-signal crisis correlation | Hetzner CPU | Done |
04_export_for_topics.py |
Export JSONL for GPU classification | Hetzner CPU | Done |
05_import_topics.py |
Import topic results from RunPod | Hetzner CPU | Ready |
06_extract_keywords.py |
TF-IDF keyword extraction | Hetzner CPU | TODO |
07_detect_redactions.py |
OpenCV redaction detection | Hetzner CPU | TODO |
08_find_duplicates.py |
Embedding-based dedup | Hetzner SQL | TODO |
09_entity_networks.py |
Co-occurrence & graph analysis | Hetzner CPU | TODO |
Running
Full pipeline (stages 1-3)
tmux new-session -s ml-pipeline "/var/www/research/ml/run_pipeline.sh"
Individual scripts
cd /var/www/research/ml
/opt/epstein_env/bin/python3 01_extract_dates.py
GPU tasks (RunPod)
Export data, transfer to RunPod, run, transfer results back:
/opt/epstein_env/bin/python3 04_export_for_topics.py
# scp topic_export.jsonl to RunPod
# Run classify_fast.py on RunPod
# scp topic_results.jsonl back
/opt/epstein_env/bin/python3 05_import_topics.py
Configuration
config.py— DB credentials, topic labels, historical events, congress datesdb.py— Database connection helpersschema.sql— Table definitions for ML pipeline
Results (as of 2026-03-06)
- 234K documents dated (181K from congress, 36K from folders, 9K from NER)
- 20 historical events seeded
- 296K document-event correlations
- Topic classification: processing on RunPod