datamatters24
/

research-document-archive

government-documents

named-entity-recognition

document-analysis

Model card Files Files and versions

research-document-archive / ml /README.md

datamatters24's picture

Upload ml/README.md with huggingface_hub

e9bd258 verified about 1 month ago

|

history blame contribute delete

2.29 kB

	# ML Pipeline — Research Document Archive

	## Overview
	Machine learning pipeline for analyzing 234K declassified government documents across 7 collections. Extracts dates, correlates documents with historical crises, classifies topics, detects redactions, and builds entity networks.

	## Prerequisites
	- Python venv: `/opt/epstein_env/` (torch, spacy, transformers, sentence-transformers)
	- PostgreSQL: `epstein_research` database
	- RunPod: GPU-heavy tasks (zero-shot classification, BERTopic)

	## Pipeline Scripts

	\| Script \| Purpose \| Runs On \| Status \|
	\|--------\|---------\|---------\|--------\|
	\| `01_extract_dates.py` \| Extract dates from filenames, congress sessions, NER \| Hetzner CPU \| Done \|
	\| `02_seed_events.py` \| Seed 20 historical events \| Hetzner CPU \| Done \|
	\| `03_correlate_crises.py` \| Multi-signal crisis correlation \| Hetzner CPU \| Done \|
	\| `04_export_for_topics.py` \| Export JSONL for GPU classification \| Hetzner CPU \| Done \|
	\| `05_import_topics.py` \| Import topic results from RunPod \| Hetzner CPU \| Ready \|
	\| `06_extract_keywords.py` \| TF-IDF keyword extraction \| Hetzner CPU \| TODO \|
	\| `07_detect_redactions.py` \| OpenCV redaction detection \| Hetzner CPU \| TODO \|
	\| `08_find_duplicates.py` \| Embedding-based dedup \| Hetzner SQL \| TODO \|
	\| `09_entity_networks.py` \| Co-occurrence & graph analysis \| Hetzner CPU \| TODO \|

	## Running

	### Full pipeline (stages 1-3)
	```bash
	tmux new-session -s ml-pipeline "/var/www/research/ml/run_pipeline.sh"
	```

	### Individual scripts
	```bash
	cd /var/www/research/ml
	/opt/epstein_env/bin/python3 01_extract_dates.py
	```

	### GPU tasks (RunPod)
	Export data, transfer to RunPod, run, transfer results back:
	```bash
	/opt/epstein_env/bin/python3 04_export_for_topics.py
	# scp topic_export.jsonl to RunPod
	# Run classify_fast.py on RunPod
	# scp topic_results.jsonl back
	/opt/epstein_env/bin/python3 05_import_topics.py
	```

	## Configuration
	- `config.py` — DB credentials, topic labels, historical events, congress dates
	- `db.py` — Database connection helpers
	- `schema.sql` — Table definitions for ML pipeline

	## Results (as of 2026-03-06)
	- 234K documents dated (181K from congress, 36K from folders, 9K from NER)
	- 20 historical events seeded
	- 296K document-event correlations
	- Topic classification: processing on RunPod