datamatters24 commited on
Commit
ac616aa
·
verified ·
1 Parent(s): 4183309

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ task_categories:
4
+ - text-classification
5
+ - token-classification
6
+ - text-mining
7
+ language:
8
+ - en
9
+ tags:
10
+ - government-documents
11
+ - nlp
12
+ - named-entity-recognition
13
+ - declassified
14
+ - jfk
15
+ - cia
16
+ - ocr
17
+ - document-analysis
18
+ size_categories:
19
+ - 100K<n<1M
20
+ ---
21
+
22
+ # Research Document Archive
23
+
24
+ 234,630 declassified U.S. government documents processed through a 13-step ML pipeline. 3.2 million pages OCR'd, 31 million named entities extracted and linked, 288 topic clusters identified.
25
+
26
+ **Live platform:** [tanglewoodapp.com](https://tanglewoodapp.com)
27
+
28
+ ## Collections
29
+
30
+ | Collection | Documents | Pages | Size |
31
+ |---|---|---|---|
32
+ | House Resolutions | 181,092 | 2,719,832 | 34.2 GB |
33
+ | JFK Assassination Records | 35,979 | 241,860 | 22.5 GB |
34
+ | CIA Stargate Program | 13,937 | 100,056 | 5.4 GB |
35
+ | CIA MKUltra | 1,936 | 64,244 | 3.4 GB |
36
+ | CIA Declassified | 1,605 | 29,744 | 2.4 GB |
37
+ | Lincoln Archives | 21 | 9,330 | 962.9 MB |
38
+
39
+ ## ML Pipeline (13 Steps)
40
+
41
+ 1. Document ingestion and format normalization
42
+ 2. OCR with Tesseract + post-correction
43
+ 3. Classification stamp detection (SECRET, CONFIDENTIAL, UNCLASSIFIED, etc.)
44
+ 4. Redaction detection and boundary mapping
45
+ 5. Named entity recognition (people, organizations, locations, dates)
46
+ 6. Entity disambiguation and cross-document linking
47
+ 7. Relationship extraction
48
+ 8. Topic modeling (LDA + BERTopic)
49
+ 9. Timeline event extraction
50
+ 10. Network graph construction
51
+ 11. Sentiment and tone analysis
52
+ 12. Document similarity clustering
53
+ 13. Index building for search and retrieval
54
+
55
+ ## Classification Stamps Detected
56
+
57
+ | Stamp | Count |
58
+ |---|---|
59
+ | UNCLASSIFIED | 16,501 |
60
+ | SECRET | 13,736 |
61
+ | CLASSIFIED | 10,730 |
62
+ | EXEMPT | 6,739 |
63
+ | CONFIDENTIAL | 5,554 |
64
+ | RESTRICTED | 4,722 |
65
+
66
+ ## Key Statistics
67
+
68
+ - **31M** named entities extracted
69
+ - **2.9M** entity cross-document links
70
+ - **59,830** redactions detected and mapped
71
+ - **288** topic clusters identified
72
+ - **6** document collections spanning 1860s–2000s
73
+
74
+ ## Usage
75
+
76
+ ```python
77
+ from datasets import load_dataset
78
+
79
+ ds = load_dataset("datamatters24/research-document-archive")
80
+
81
+ # Filter by collection
82
+ jfk = ds.filter(lambda x: x["collection"] == "jfk_assassination")
83
+
84
+ # Search by entity
85
+ cia_docs = ds.filter(lambda x: "CIA" in x["entities"])
86
+ ```
87
+
88
+ ## Data Sources
89
+
90
+ All documents are public record obtained from:
91
+ - National Archives (NARA)
92
+ - CIA FOIA Reading Room
93
+ - Congress.gov
94
+ - Library of Congress
95
+
96
+ ## Citation
97
+
98
+ ```bibtex
99
+ @misc{rubin2026researcharchive,
100
+ author = {Rubin, Theodore},
101
+ title = {Research Document Archive: ML Pipeline for Declassified U.S. Government Documents},
102
+ year = {2026},
103
+ publisher = {HuggingFace},
104
+ url = {https://huggingface.co/datasets/datamatters24/research-document-archive}
105
+ }
106
+ ```