File size: 2,694 Bytes
ac616aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: mit
task_categories:
- text-classification
- token-classification
- text-mining
language:
- en
tags:
- government-documents
- nlp
- named-entity-recognition
- declassified
- jfk
- cia
- ocr
- document-analysis
size_categories:
- 100K<n<1M
---

# Research Document Archive

234,630 declassified U.S. government documents processed through a 13-step ML pipeline. 3.2 million pages OCR'd, 31 million named entities extracted and linked, 288 topic clusters identified.

**Live platform:** [tanglewoodapp.com](https://tanglewoodapp.com)

## Collections

| Collection | Documents | Pages | Size |
|---|---|---|---|
| House Resolutions | 181,092 | 2,719,832 | 34.2 GB |
| JFK Assassination Records | 35,979 | 241,860 | 22.5 GB |
| CIA Stargate Program | 13,937 | 100,056 | 5.4 GB |
| CIA MKUltra | 1,936 | 64,244 | 3.4 GB |
| CIA Declassified | 1,605 | 29,744 | 2.4 GB |
| Lincoln Archives | 21 | 9,330 | 962.9 MB |

## ML Pipeline (13 Steps)

1. Document ingestion and format normalization
2. OCR with Tesseract + post-correction
3. Classification stamp detection (SECRET, CONFIDENTIAL, UNCLASSIFIED, etc.)
4. Redaction detection and boundary mapping
5. Named entity recognition (people, organizations, locations, dates)
6. Entity disambiguation and cross-document linking
7. Relationship extraction
8. Topic modeling (LDA + BERTopic)
9. Timeline event extraction
10. Network graph construction
11. Sentiment and tone analysis
12. Document similarity clustering
13. Index building for search and retrieval

## Classification Stamps Detected

| Stamp | Count |
|---|---|
| UNCLASSIFIED | 16,501 |
| SECRET | 13,736 |
| CLASSIFIED | 10,730 |
| EXEMPT | 6,739 |
| CONFIDENTIAL | 5,554 |
| RESTRICTED | 4,722 |

## Key Statistics

- **31M** named entities extracted
- **2.9M** entity cross-document links
- **59,830** redactions detected and mapped
- **288** topic clusters identified
- **6** document collections spanning 1860s–2000s

## Usage

```python
from datasets import load_dataset

ds = load_dataset("datamatters24/research-document-archive")

# Filter by collection
jfk = ds.filter(lambda x: x["collection"] == "jfk_assassination")

# Search by entity
cia_docs = ds.filter(lambda x: "CIA" in x["entities"])
```

## Data Sources

All documents are public record obtained from:
- National Archives (NARA)
- CIA FOIA Reading Room
- Congress.gov
- Library of Congress

## Citation

```bibtex
@misc{rubin2026researcharchive,
  author = {Rubin, Theodore},
  title = {Research Document Archive: ML Pipeline for Declassified U.S. Government Documents},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/datasets/datamatters24/research-document-archive}
}
```