datamatters24
/

research-document-archive

government-documents

named-entity-recognition

document-analysis

Model card Files Files and versions

datamatters24 commited on Apr 9

Commit

ac616aa

·

verified ·

1 Parent(s): 4183309

Create README.md

Files changed (1) hide show

README.md +106 -0

README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+---
+license: mit
+task_categories:
+- text-classification
+- token-classification
+- text-mining
+language:
+- en
+tags:
+- government-documents
+- nlp
+- named-entity-recognition
+- declassified
+- jfk
+- cia
+- ocr
+- document-analysis
+size_categories:
+- 100K<n<1M
+---
+# Research Document Archive
+234,630 declassified U.S. government documents processed through a 13-step ML pipeline. 3.2 million pages OCR'd, 31 million named entities extracted and linked, 288 topic clusters identified.
+**Live platform:** [tanglewoodapp.com](https://tanglewoodapp.com)
+## Collections
+| Collection | Documents | Pages | Size |
+|---|---|---|---|
+| House Resolutions | 181,092 | 2,719,832 | 34.2 GB |
+| JFK Assassination Records | 35,979 | 241,860 | 22.5 GB |
+| CIA Stargate Program | 13,937 | 100,056 | 5.4 GB |
+| CIA MKUltra | 1,936 | 64,244 | 3.4 GB |
+| CIA Declassified | 1,605 | 29,744 | 2.4 GB |
+| Lincoln Archives | 21 | 9,330 | 962.9 MB |
+## ML Pipeline (13 Steps)
+1. Document ingestion and format normalization
+2. OCR with Tesseract + post-correction
+3. Classification stamp detection (SECRET, CONFIDENTIAL, UNCLASSIFIED, etc.)
+4. Redaction detection and boundary mapping
+5. Named entity recognition (people, organizations, locations, dates)
+6. Entity disambiguation and cross-document linking
+7. Relationship extraction
+8. Topic modeling (LDA + BERTopic)
+9. Timeline event extraction
+10. Network graph construction
+11. Sentiment and tone analysis
+12. Document similarity clustering
+13. Index building for search and retrieval
+## Classification Stamps Detected
+| Stamp | Count |
+|---|---|
+| UNCLASSIFIED | 16,501 |
+| SECRET | 13,736 |
+| CLASSIFIED | 10,730 |
+| EXEMPT | 6,739 |
+| CONFIDENTIAL | 5,554 |
+| RESTRICTED | 4,722 |
+## Key Statistics
+- **31M** named entities extracted
+- **2.9M** entity cross-document links
+- **59,830** redactions detected and mapped
+- **288** topic clusters identified
+- **6** document collections spanning 1860s–2000s
+## Usage
+```python
+from datasets import load_dataset
+ds = load_dataset("datamatters24/research-document-archive")
+# Filter by collection
+jfk = ds.filter(lambda x: x["collection"] == "jfk_assassination")
+# Search by entity
+cia_docs = ds.filter(lambda x: "CIA" in x["entities"])
+```
+## Data Sources
+All documents are public record obtained from:
+- National Archives (NARA)
+- CIA FOIA Reading Room
+- Congress.gov
+- Library of Congress
+## Citation
+```bibtex
+@misc{rubin2026researcharchive,
+  author = {Rubin, Theodore},
+  title = {Research Document Archive: ML Pipeline for Declassified U.S. Government Documents},
+  year = {2026},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/datasets/datamatters24/research-document-archive}
+}
+```