datamatters24's picture
Upload ml/README.md with huggingface_hub
e9bd258 verified

ML Pipeline — Research Document Archive

Overview

Machine learning pipeline for analyzing 234K declassified government documents across 7 collections. Extracts dates, correlates documents with historical crises, classifies topics, detects redactions, and builds entity networks.

Prerequisites

  • Python venv: /opt/epstein_env/ (torch, spacy, transformers, sentence-transformers)
  • PostgreSQL: epstein_research database
  • RunPod: GPU-heavy tasks (zero-shot classification, BERTopic)

Pipeline Scripts

Script Purpose Runs On Status
01_extract_dates.py Extract dates from filenames, congress sessions, NER Hetzner CPU Done
02_seed_events.py Seed 20 historical events Hetzner CPU Done
03_correlate_crises.py Multi-signal crisis correlation Hetzner CPU Done
04_export_for_topics.py Export JSONL for GPU classification Hetzner CPU Done
05_import_topics.py Import topic results from RunPod Hetzner CPU Ready
06_extract_keywords.py TF-IDF keyword extraction Hetzner CPU TODO
07_detect_redactions.py OpenCV redaction detection Hetzner CPU TODO
08_find_duplicates.py Embedding-based dedup Hetzner SQL TODO
09_entity_networks.py Co-occurrence & graph analysis Hetzner CPU TODO

Running

Full pipeline (stages 1-3)

tmux new-session -s ml-pipeline "/var/www/research/ml/run_pipeline.sh"

Individual scripts

cd /var/www/research/ml
/opt/epstein_env/bin/python3 01_extract_dates.py

GPU tasks (RunPod)

Export data, transfer to RunPod, run, transfer results back:

/opt/epstein_env/bin/python3 04_export_for_topics.py
# scp topic_export.jsonl to RunPod
# Run classify_fast.py on RunPod
# scp topic_results.jsonl back
/opt/epstein_env/bin/python3 05_import_topics.py

Configuration

  • config.py — DB credentials, topic labels, historical events, congress dates
  • db.py — Database connection helpers
  • schema.sql — Table definitions for ML pipeline

Results (as of 2026-03-06)

  • 234K documents dated (181K from congress, 36K from folders, 9K from NER)
  • 20 historical events seeded
  • 296K document-event correlations
  • Topic classification: processing on RunPod