jcbowyer commited on
Commit
bf3e681
Β·
verified Β·
1 Parent(s): 58dc1f4

Deploy: Consolidated gold tables, fixed nginx docs routing

Browse files
CITATIONS.md CHANGED
@@ -598,6 +598,57 @@ The NCCS Unified BMF is a longitudinal nonprofit dataset specifically designed f
598
 
599
  ---
600
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
601
  ### **Charity Navigator** ⭐
602
 
603
  **Powered by Charity Navigator**
@@ -725,6 +776,62 @@ This project complies with Charity Navigator's API Terms of Use, including:
725
 
726
  ---
727
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
728
  ### **IRS Exempt Organizations Business Master File (EO-BMF)**
729
 
730
  Basic nonprofit registration data (name, EIN, address, NTEE code).
 
598
 
599
  ---
600
 
601
+ ### **Fundraising Effectiveness Project (FEP)** ⭐
602
+
603
+ The Fundraising Effectiveness Project provides comprehensive research and benchmarking data on nonprofit fundraising trends and donor behavior.
604
+
605
+ **Organization:** Association of Fundraising Professionals (AFP) & Growth in Giving Initiative
606
+ **Website:** https://afpglobal.org/fundraising-effectiveness-project
607
+ **Publications:** https://publications.fepreports.org/
608
+ **Data Portal:** https://data.givingtuesday.org/fep/
609
+ **License:** Research data available for nonprofit sector analysis
610
+ **Coverage:** Quarterly fundraising data from 5,000+ U.S. nonprofits (2005-present)
611
+
612
+ **What we use:**
613
+ - **Donor retention metrics**: Longitudinal donor behavior and retention rates
614
+ - **Fundraising benchmarks**: Revenue trends across nonprofit sectors and organization sizes
615
+ - **Giving trends**: Analysis of individual giving, major donors, and fundraising effectiveness
616
+ - **Sector analysis**: NTEE-based comparisons across health, education, human services, etc.
617
+ - **Economic indicators**: Correlation between fundraising and economic conditions
618
+
619
+ **Key Metrics:**
620
+ - βœ… **Donor retention rates**: Track donor loyalty and lapsed donor trends
621
+ - βœ… **Dollar retention**: Revenue retention from recurring vs. new donors
622
+ - βœ… **Average gift sizes**: Trends in donation amounts by donor segment
623
+ - βœ… **New donor acquisition**: Cost and effectiveness of donor recruitment
624
+ - βœ… **Quarterly benchmarks**: Real-time fundraising performance indicators
625
+
626
+ **Use Cases:**
627
+ - Benchmarking nonprofit fundraising performance
628
+ - Predicting revenue trends for budget planning
629
+ - Analyzing donor engagement patterns
630
+ - Policy research on charitable giving
631
+ - Advocacy for nonprofit sector sustainability
632
+
633
+ **BibTeX:**
634
+ ```bibtex
635
+ @misc{fundraising_effectiveness_project,
636
+ title = {Fundraising Effectiveness Project},
637
+ author = {{Association of Fundraising Professionals} and {Growth in Giving Initiative}},
638
+ year = {2024},
639
+ url = {https://publications.fepreports.org/},
640
+ note = {Comprehensive research and benchmarking data on nonprofit fundraising trends and donor behavior}
641
+ }
642
+ ```
643
+
644
+ **Attribution:** When using FEP data, cite:
645
+ 1. Fundraising Effectiveness Project (FEP)
646
+ 2. Association of Fundraising Professionals (AFP)
647
+ 3. Growth in Giving Initiative
648
+ 4. Specify the quarter/year of data used
649
+
650
+ ---
651
+
652
  ### **Charity Navigator** ⭐
653
 
654
  **Powered by Charity Navigator**
 
776
 
777
  ---
778
 
779
+ ### **fecfile - Python FEC Filing Parser** ⭐
780
+
781
+ **Python library for parsing Federal Election Commission (FEC) electronic filings**
782
+
783
+ **Repository:** https://github.com/esonderegger/fecfile
784
+ **Author:** Evan Sonderegger
785
+ **License:** MIT License (open source)
786
+ **Language:** Python
787
+ **Purpose:** Parse FEC electronic filing formats (ASCII, CSV, JSON)
788
+
789
+ **What it does:**
790
+ - **Parse .fec files**: Converts FEC electronic filing format to structured data
791
+ - **Multiple output formats**: CSV, JSON, and Python dictionaries
792
+ - **Version support**: Handles multiple FEC filing format versions
793
+ - **Data validation**: Validates filing structure and data types
794
+ - **Command-line tool**: Easy conversion of .fec files without writing code
795
+ - **Python API**: Programmatic access for custom ETL pipelines
796
+
797
+ **Use Cases:**
798
+ - Converting FEC bulk data downloads to CSV/JSON for analysis
799
+ - Building campaign finance databases from raw FEC filings
800
+ - ETL pipelines for loading FEC data into SQL/NoSQL databases
801
+ - Data validation and quality checking of FEC submissions
802
+ - Research on campaign contributions and political spending
803
+
804
+ **What we use it for:**
805
+ - Parsing FEC bulk data downloads from https://www.fec.gov/data/browse-data/?tab=bulk-data
806
+ - Converting .fec electronic filings to structured formats
807
+ - Loading campaign finance data into our data lake
808
+ - Cross-referencing campaign contributions with nonprofit advocacy spending
809
+
810
+ **BibTeX:**
811
+ ```bibtex
812
+ @software{fecfile,
813
+ title = {fecfile: Python FEC Filing Parser},
814
+ author = {Sonderegger, Evan},
815
+ year = {2024},
816
+ url = {https://github.com/esonderegger/fecfile},
817
+ license = {MIT},
818
+ note = {Python library for parsing Federal Election Commission electronic filing formats}
819
+ }
820
+ ```
821
+
822
+ **Related FEC Resources:**
823
+ - **FEC Bulk Data:** https://www.fec.gov/data/browse-data/?tab=bulk-data
824
+ - **FEC Data Catalog:** https://www.fec.gov/data/
825
+ - **OpenFEC API:** https://api.open.fec.gov/developers/
826
+ - **FEC Filing Formats:** https://www.fec.gov/data/browse-data/?tab=bulk-data
827
+
828
+ **Attribution:**
829
+ When using fecfile in your research or applications, cite both:
830
+ 1. The fecfile library (Evan Sonderegger)
831
+ 2. The Federal Election Commission as the original data source
832
+
833
+ ---
834
+
835
  ### **IRS Exempt Organizations Business Master File (EO-BMF)**
836
 
837
  Basic nonprofit registration data (name, EIN, address, NTEE code).
api/routes/auth.py CHANGED
@@ -81,6 +81,37 @@ class UserResponse(BaseModel):
81
 
82
 
83
  # Helper functions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  def get_or_create_user(
85
  db: Session,
86
  email: str,
@@ -184,8 +215,8 @@ async def oauth_login(
184
  db.add(oauth_state)
185
  db.commit()
186
 
187
- # Build callback URL using API_BASE_URL to ensure correct protocol (http vs https)
188
- base_url = os.getenv('API_BASE_URL', 'http://localhost:8000')
189
  callback_url = f"{base_url}/api/auth/callback/{provider}"
190
 
191
  # Build authorization URL
@@ -204,6 +235,7 @@ async def oauth_login(
204
  @router.get("/callback/{provider}", name="oauth_callback")
205
  async def oauth_callback(
206
  provider: str,
 
207
  code: Optional[str] = None,
208
  state: Optional[str] = None,
209
  error: Optional[str] = None,
@@ -233,10 +265,8 @@ async def oauth_callback(
233
  client_id = os.getenv(config['client_id_env'])
234
  client_secret = os.getenv(config['client_secret_env'])
235
 
236
- # Build callback URL (must match the one sent to authorize)
237
- from fastapi import Request
238
- # We need to reconstruct the callback URL - for now use a simple approach
239
- base_url = os.getenv('API_BASE_URL', 'http://localhost:8000')
240
  callback_url = f"{base_url}/api/auth/callback/{provider}"
241
 
242
  # Exchange code for access token
 
81
 
82
 
83
  # Helper functions
84
+ def get_base_url(request: Request) -> str:
85
+ """
86
+ Get the base URL from the request, handling proxy headers
87
+
88
+ In production (HuggingFace Spaces with nginx reverse proxy):
89
+ - Returns: https://www.communityone.com
90
+
91
+ In local development:
92
+ - Returns: http://localhost:8000
93
+ """
94
+ # Check for explicit API_BASE_URL override first
95
+ if base_url := os.getenv('API_BASE_URL'):
96
+ # Only use if it's not the default localhost value
97
+ if 'localhost' not in base_url and '127.0.0.1' not in base_url:
98
+ return base_url
99
+
100
+ # Detect from request headers (handles nginx reverse proxy)
101
+ scheme = request.headers.get('x-forwarded-proto', request.url.scheme)
102
+ host = request.headers.get('x-forwarded-host', request.headers.get('host', request.url.netloc))
103
+
104
+ # Clean up host (remove port if it's standard)
105
+ if ':' in host:
106
+ host_parts = host.split(':')
107
+ port = host_parts[1]
108
+ # Remove standard ports
109
+ if (scheme == 'https' and port == '443') or (scheme == 'http' and port == '80'):
110
+ host = host_parts[0]
111
+
112
+ return f"{scheme}://{host}"
113
+
114
+
115
  def get_or_create_user(
116
  db: Session,
117
  email: str,
 
215
  db.add(oauth_state)
216
  db.commit()
217
 
218
+ # Build callback URL dynamically from request (handles both local and production)
219
+ base_url = get_base_url(request)
220
  callback_url = f"{base_url}/api/auth/callback/{provider}"
221
 
222
  # Build authorization URL
 
235
  @router.get("/callback/{provider}", name="oauth_callback")
236
  async def oauth_callback(
237
  provider: str,
238
+ request: Request,
239
  code: Optional[str] = None,
240
  state: Optional[str] = None,
241
  error: Optional[str] = None,
 
265
  client_id = os.getenv(config['client_id_env'])
266
  client_secret = os.getenv(config['client_secret_env'])
267
 
268
+ # Build callback URL dynamically from request (must match the one sent to authorize)
269
+ base_url = get_base_url(request)
 
 
270
  callback_url = f"{base_url}/api/auth/callback/{provider}"
271
 
272
  # Exchange code for access token
scripts/datasources/fec/README.md CHANGED
@@ -94,6 +94,122 @@ python bulk_download_fec.py --dry-run
94
  └── ...
95
  ```
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ### `fec_integration.py`
98
  Integrate FEC API data for real-time queries.
99
 
 
94
  └── ...
95
  ```
96
 
97
+ ### `unzip_fec_data.py` (High-Performance Edition)
98
+ Unzip all FEC bulk data files with parallel processing and 7-Zip support for maximum speed.
99
+
100
+ **Performance Modes:**
101
+ - **Parallel Processing**: 4-8x faster with `--workers 8`
102
+ - **7-Zip Extraction**: 2-3x faster than Python zipfile
103
+ - **Combined**: 10-15x faster with `--method 7z --workers 8`
104
+
105
+ **Features:**
106
+ - Multiple extraction methods (Python zipfile, 7-Zip, auto-detect)
107
+ - Parallel processing with configurable worker count
108
+ - Maintains same folder hierarchy as source
109
+ - Resume support (skip already unzipped files)
110
+ - Progress tracking and logging
111
+ - Optional: Remove ZIP files after extraction
112
+ - Filter by category or year
113
+
114
+ **Usage:**
115
+ ```bash
116
+ # RECOMMENDED: Unzip latest 2 years only with 8 workers (FAST & QUICK)
117
+ python unzip_fec_data.py --latest 2 --workers 8 --base-dir /mnt/d/fec_data
118
+
119
+ # FASTEST: Use 7-Zip with 8 parallel workers (10-15x faster, all years)
120
+ python unzip_fec_data.py --method 7z --workers 8 --base-dir /mnt/d/fec_data
121
+
122
+ # Fast: Use parallel workers only (4-8x faster)
123
+ python unzip_fec_data.py --workers 8 --base-dir /mnt/d/fec_data
124
+
125
+ # Moderate: Use 7-Zip single-threaded (2-3x faster)
126
+ python unzip_fec_data.py --method 7z --base-dir /mnt/d/fec_data
127
+
128
+ # Default: Python zipfile single-threaded (portable but slow)
129
+ python unzip_fec_data.py --base-dir /mnt/d/fec_data
130
+
131
+ # Auto-detect best method and optimal workers
132
+ python unzip_fec_data.py --method auto --workers 0 --base-dir /mnt/d/fec_data
133
+
134
+ # Unzip specific category with parallel workers
135
+ python unzip_fec_data.py --category candidate-master --workers 4
136
+
137
+ # Unzip specific years with parallel workers
138
+ python unzip_fec_data.py --years 2020,2022,2024 --workers 4
139
+
140
+ # Unzip latest 5 years only (auto-detects 2020-2024)
141
+ python unzip_fec_data.py --latest 5 --workers 8
142
+
143
+ # Resume interrupted extraction
144
+ python unzip_fec_data.py --resume --workers 8
145
+
146
+ # Dry run (show what would be unzipped)
147
+ python unzip_fec_data.py --dry-run
148
+
149
+ # Remove ZIP files after successful extraction (saves 50% disk space)
150
+ python unzip_fec_data.py --remove-zips --workers 8
151
+ ```
152
+
153
+ **Installation for 7-Zip (optional but recommended):**
154
+ ```bash
155
+ # Ubuntu/Debian
156
+ sudo apt-get install p7zip-full
157
+
158
+ # macOS
159
+ brew install p7zip
160
+
161
+ # Verify installation
162
+ 7z --help
163
+ ```
164
+
165
+ **Output Structure:**
166
+ ```
167
+ /mnt/d/fec_data/
168
+ β”œβ”€β”€ bulk-downloads/ # Original ZIP files (source)
169
+ β”‚ β”œβ”€β”€ candidate-master/
170
+ β”‚ β”‚ β”œβ”€β”€ 1980/cn80.zip
171
+ β”‚ β”‚ └── 2024/cn24.zip
172
+ β”‚ └── ...
173
+ └── unzipped/ # Unzipped CSV/TXT files (destination)
174
+ β”œβ”€β”€ candidate-master/
175
+ β”‚ β”œβ”€β”€ 1980/
176
+ β”‚ β”‚ β”œβ”€β”€ cn80/
177
+ β”‚ β”‚ β”‚ β”œβ”€β”€ cn.txt
178
+ β”‚ β”‚ β”‚ β”œβ”€β”€ cn_header_file.csv
179
+ β”‚ β”‚ β”‚ └── ...
180
+ β”‚ └── 2024/
181
+ β”‚ └── cn24/
182
+ β”‚ β”œβ”€β”€ cn.txt
183
+ β”‚ └── ...
184
+ β”œβ”€β”€ contributions-by-individuals/
185
+ β”‚ └── 2024/
186
+ β”‚ └── indiv24/
187
+ β”‚ β”œβ”€β”€ indiv.txt
188
+ β”‚ β”œβ”€β”€ indiv_header_file.csv
189
+ β”‚ └── ...
190
+ └── ...
191
+ ```
192
+
193
+ **Workflow:**
194
+ 1. Download FEC bulk data: `python bulk_download_fec.py --base-dir /mnt/d/fec_data`
195
+ 2. **QUICK START** - Unzip latest 2 years only: `python unzip_fec_data.py --latest 2 --workers 8 --base-dir /mnt/d/fec_data`
196
+ - OR **FULL** - Unzip all files (FAST): `python unzip_fec_data.py --method 7z --workers 8 --base-dir /mnt/d/fec_data`
197
+ 3. (Optional) Remove ZIPs to save space: Add `--remove-zips` flag to step 2
198
+
199
+ **Performance Comparison:**
200
+
201
+ | Method | Workers | Speed | Time (100 files) |
202
+ |--------|---------|-------|------------------|
203
+ | Python zipfile | 1 | 1x | ~100 min |
204
+ | Python zipfile | 8 | 4-6x | ~15-20 min |
205
+ | 7-Zip | 1 | 2-3x | ~30-40 min |
206
+ | 7-Zip | 8 | 10-15x | ~7-10 min ⚑ |
207
+
208
+ **Recommended Settings:**
209
+ - **Maximum speed**: `--method 7z --workers 8` (requires 7z installed)
210
+ - **Good balance**: `--workers 4` (no additional software needed)
211
+ - **Portable**: Default (works everywhere, no setup)
212
+
213
  ### `fec_integration.py`
214
  Integrate FEC API data for real-time queries.
215
 
scripts/datasources/fec/unzip_fec_data.py ADDED
@@ -0,0 +1,671 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ FEC Bulk Data Unzipper (High-Performance Edition)
4
+
5
+ Unzips all FEC bulk data files downloaded by bulk_download_fec.py
6
+ from D:/fec_data/bulk-downloads/ to D:/fec_data/unzipped/
7
+
8
+ Supports multiple extraction methods for maximum speed:
9
+ - Python zipfile (default, portable)
10
+ - 7-Zip (2-3x faster if installed)
11
+ - Parallel processing (4-8x faster with multiple workers)
12
+
13
+ Directory Structure:
14
+ D:/fec_data/
15
+ β”œβ”€β”€ bulk-downloads/ # Original ZIP files (source)
16
+ β”‚ β”œβ”€β”€ candidate-master/
17
+ β”‚ β”‚ β”œβ”€β”€ 1980/cn80.zip
18
+ β”‚ β”‚ └── 2024/cn24.zip
19
+ β”‚ β”œβ”€β”€ contributions-by-individuals/
20
+ β”‚ β”‚ └── 2024/indiv24.zip
21
+ β”‚ └── ...
22
+ └── unzipped/ # Unzipped CSV files (destination)
23
+ β”œβ”€β”€ candidate-master/
24
+ β”‚ β”œβ”€β”€ 1980/
25
+ β”‚ β”‚ β”œβ”€β”€ cn80.txt
26
+ β”‚ β”‚ β”œβ”€β”€ cn_header_file.csv
27
+ β”‚ β”‚ └── ...
28
+ β”‚ └── 2024/
29
+ β”‚ β”œβ”€β”€ cn24.txt
30
+ β”‚ └── ...
31
+ β”œβ”€β”€ contributions-by-individuals/
32
+ β”‚ └── 2024/
33
+ β”‚ β”œβ”€β”€ indiv24.txt
34
+ β”‚ β”œβ”€β”€ indiv_header_file.csv
35
+ β”‚ └── ...
36
+ └── ...
37
+
38
+ Usage:
39
+ # Quick start: Unzip only the latest 2 years with 8 workers (RECOMMENDED)
40
+ python unzip_fec_data.py --latest 2 --workers 8 --base-dir /mnt/d/fec_data
41
+
42
+ # Fast: Use 8 parallel workers (4-8x faster)
43
+ python unzip_fec_data.py --workers 8 --base-dir /mnt/d/fec_data
44
+
45
+ # Fastest: Use 7-Zip with 8 workers (10-15x faster if 7z installed)
46
+ python unzip_fec_data.py --method 7z --workers 8 --base-dir /mnt/d/fec_data
47
+
48
+ # Default (single-threaded Python)
49
+ python unzip_fec_data.py --base-dir /mnt/d/fec_data
50
+
51
+ # Specific category only
52
+ python unzip_fec_data.py --category candidate-master --workers 4
53
+
54
+ # Specific years only
55
+ python unzip_fec_data.py --years 2020,2022,2024 --workers 4
56
+
57
+ # Latest 5 years only
58
+ python unzip_fec_data.py --latest 5 --workers 8
59
+
60
+ # Resume interrupted extraction
61
+ python unzip_fec_data.py --resume --workers 8
62
+
63
+ # Dry run (show what would be unzipped)
64
+ python unzip_fec_data.py --dry-run
65
+
66
+ # Remove ZIP files after successful extraction (saves 50% disk space)
67
+ python unzip_fec_data.py --remove-zips --workers 8
68
+ """
69
+
70
+ import argparse
71
+ import json
72
+ import zipfile
73
+ import sys
74
+ import subprocess
75
+ import shutil
76
+ from pathlib import Path
77
+ from typing import Dict, List, Optional, Set, Tuple
78
+ from datetime import datetime
79
+ from loguru import logger
80
+ from tqdm import tqdm
81
+ from concurrent.futures import ProcessPoolExecutor, as_completed
82
+ from functools import partial
83
+
84
+
85
+ def _unzip_worker(args: Tuple[Path, Path, bool, Path, bool, bool, str]) -> Tuple[bool, Path, Path, int]:
86
+ """
87
+ Worker function for parallel unzipping (must be at module level for pickling)
88
+
89
+ Args:
90
+ args: Tuple of (zip_path, dest_dir, dry_run, base_dir, use_7z, remove_zips, method)
91
+
92
+ Returns:
93
+ Tuple of (success, zip_path, dest_dir, file_count)
94
+ """
95
+ zip_path, dest_dir, dry_run, base_dir, use_7z, remove_zips, method = args
96
+
97
+ if dry_run:
98
+ return True, zip_path, dest_dir, 0
99
+
100
+ try:
101
+ # Create destination directory
102
+ dest_dir.mkdir(parents=True, exist_ok=True)
103
+
104
+ # Extract using chosen method
105
+ if use_7z:
106
+ # Use 7-Zip
107
+ result = subprocess.run(
108
+ ['7z', 'x', str(zip_path), f'-o{dest_dir}', '-y'],
109
+ capture_output=True,
110
+ text=True,
111
+ check=True
112
+ )
113
+ file_list = [str(f.relative_to(dest_dir)) for f in dest_dir.rglob('*') if f.is_file()]
114
+ else:
115
+ # Use Python zipfile
116
+ with zipfile.ZipFile(zip_path, 'r') as zf:
117
+ file_list = zf.namelist()
118
+ zf.extractall(dest_dir)
119
+
120
+ file_count = len(file_list)
121
+
122
+ # Remove ZIP file if requested
123
+ removed = False
124
+ if remove_zips:
125
+ zip_path.unlink()
126
+ removed = True
127
+
128
+ return True, zip_path, dest_dir, file_count
129
+
130
+ except Exception as e:
131
+ return False, zip_path, dest_dir, 0
132
+
133
+
134
+ class FECBulkUnzipper:
135
+ """Unzip FEC bulk data files with parallel processing and 7-Zip support"""
136
+
137
+ def __init__(
138
+ self,
139
+ base_dir: Path,
140
+ resume: bool = False,
141
+ remove_zips: bool = False,
142
+ method: str = 'python',
143
+ workers: int = 1
144
+ ):
145
+ """
146
+ Initialize FEC bulk unzipper
147
+
148
+ Args:
149
+ base_dir: Base directory containing bulk-downloads/ (e.g., D:/fec_data/)
150
+ resume: Skip already unzipped files
151
+ remove_zips: Remove ZIP files after successful extraction
152
+ method: Extraction method ('python', '7z', or 'auto')
153
+ workers: Number of parallel workers (1 = single-threaded)
154
+ """
155
+ self.base_dir = Path(base_dir)
156
+ self.bulk_dir = self.base_dir / "bulk-downloads"
157
+ self.unzipped_dir = self.base_dir / "unzipped"
158
+ self.log_file = self.base_dir / "unzip_log.json"
159
+ self.resume = resume
160
+ self.remove_zips = remove_zips
161
+ self.method = method
162
+ self.workers = workers
163
+
164
+ # Validate source directory exists
165
+ if not self.bulk_dir.exists():
166
+ logger.error(f"❌ Source directory not found: {self.bulk_dir}")
167
+ logger.info(f"πŸ’‘ Run bulk_download_fec.py first to download FEC data")
168
+ sys.exit(1)
169
+
170
+ # Create destination directory
171
+ self.unzipped_dir.mkdir(parents=True, exist_ok=True)
172
+
173
+ # Detect extraction method
174
+ self.use_7z = self._detect_extraction_method()
175
+
176
+ # Load unzip log
177
+ self.unzip_log = self._load_log()
178
+
179
+ # Statistics
180
+ self.stats = {
181
+ 'total_zips': 0,
182
+ 'unzipped': 0,
183
+ 'skipped': 0,
184
+ 'failed': 0,
185
+ 'removed': 0,
186
+ }
187
+
188
+ def _detect_extraction_method(self) -> bool:
189
+ """Detect if 7-Zip is available and choose best method"""
190
+ if self.method == 'python':
191
+ logger.info("πŸ“¦ Using Python zipfile (portable)")
192
+ return False
193
+
194
+ if self.method == '7z':
195
+ if shutil.which('7z'):
196
+ logger.info("⚑ Using 7-Zip (2-3x faster)")
197
+ return True
198
+ else:
199
+ logger.warning("⚠️ 7z not found, falling back to Python zipfile")
200
+ logger.info("πŸ’‘ Install with: sudo apt-get install p7zip-full")
201
+ return False
202
+
203
+ if self.method == 'auto':
204
+ if shutil.which('7z'):
205
+ logger.info("⚑ Using 7-Zip (auto-detected, 2-3x faster)")
206
+ return True
207
+ else:
208
+ logger.info("πŸ“¦ Using Python zipfile (7z not found)")
209
+ return False
210
+
211
+ logger.warning(f"⚠️ Unknown method '{self.method}', using Python zipfile")
212
+ return False
213
+
214
+ # Load unzip log
215
+ self.unzip_log = self._load_log()
216
+
217
+ # Statistics
218
+ self.stats = {
219
+ 'total_zips': 0,
220
+ 'unzipped': 0,
221
+ 'skipped': 0,
222
+ 'failed': 0,
223
+ 'removed': 0,
224
+ }
225
+
226
+ def _load_log(self) -> Dict:
227
+ """Load unzip log"""
228
+ if self.log_file.exists():
229
+ with open(self.log_file) as f:
230
+ return json.load(f)
231
+ return {
232
+ 'started': datetime.now().isoformat(),
233
+ 'last_updated': None,
234
+ 'completed_files': {},
235
+ 'failed_files': {},
236
+ }
237
+
238
+ def _save_log(self):
239
+ """Save unzip log"""
240
+ self.unzip_log['last_updated'] = datetime.now().isoformat()
241
+ with open(self.log_file, 'w') as f:
242
+ json.dump(self.unzip_log, f, indent=2)
243
+
244
+ def _is_unzipped(self, zip_path: Path, dest_dir: Path) -> bool:
245
+ """Check if ZIP file is already unzipped"""
246
+ if not self.resume:
247
+ return False
248
+
249
+ # Check if in completed log
250
+ zip_key = str(zip_path.relative_to(self.base_dir))
251
+ if zip_key in self.unzip_log['completed_files']:
252
+ unzip_info = self.unzip_log['completed_files'][zip_key]
253
+
254
+ # Verify destination directory exists and has files
255
+ if dest_dir.exists() and any(dest_dir.iterdir()):
256
+ # Check if all expected files exist
257
+ expected_files = unzip_info.get('extracted_files', [])
258
+ if expected_files:
259
+ all_exist = all(
260
+ (dest_dir / f).exists()
261
+ for f in expected_files
262
+ )
263
+ if all_exist:
264
+ return True
265
+
266
+ return False
267
+
268
+ def _unzip_with_python(self, zip_path: Path, dest_dir: Path) -> Tuple[bool, List[str]]:
269
+ """Unzip using Python's zipfile module"""
270
+ with zipfile.ZipFile(zip_path, 'r') as zf:
271
+ file_list = zf.namelist()
272
+ zf.extractall(dest_dir)
273
+ return True, file_list
274
+
275
+ def _unzip_with_7z(self, zip_path: Path, dest_dir: Path) -> Tuple[bool, List[str]]:
276
+ """Unzip using 7-Zip (2-3x faster)"""
277
+ try:
278
+ # Run 7z extract command
279
+ result = subprocess.run(
280
+ ['7z', 'x', str(zip_path), f'-o{dest_dir}', '-y'],
281
+ capture_output=True,
282
+ text=True,
283
+ check=True
284
+ )
285
+
286
+ # Get list of extracted files from dest_dir
287
+ file_list = [
288
+ str(f.relative_to(dest_dir))
289
+ for f in dest_dir.rglob('*')
290
+ if f.is_file()
291
+ ]
292
+
293
+ return True, file_list
294
+
295
+ except subprocess.CalledProcessError as e:
296
+ logger.error(f"7z extraction failed: {e.stderr}")
297
+ return False, []
298
+
299
+ def _unzip_file(
300
+ self,
301
+ zip_path: Path,
302
+ dest_dir: Path,
303
+ dry_run: bool = False
304
+ ) -> bool:
305
+ """
306
+ Unzip a single file
307
+
308
+ Args:
309
+ zip_path: Path to ZIP file
310
+ dest_dir: Destination directory
311
+ dry_run: If True, don't actually unzip
312
+
313
+ Returns:
314
+ True if successful, False otherwise
315
+ """
316
+ zip_key = str(zip_path.relative_to(self.base_dir))
317
+
318
+ # Check if already unzipped
319
+ if self._is_unzipped(zip_path, dest_dir):
320
+ self.stats['skipped'] += 1
321
+ return True
322
+
323
+ if dry_run:
324
+ logger.info(f"πŸ” Would unzip: {zip_path} β†’ {dest_dir}")
325
+ return True
326
+
327
+ try:
328
+ # Create destination directory
329
+ dest_dir.mkdir(parents=True, exist_ok=True)
330
+
331
+ # Extract using chosen method
332
+ if self.use_7z:
333
+ success, file_list = self._unzip_with_7z(zip_path, dest_dir)
334
+ else:
335
+ success, file_list = self._unzip_with_python(zip_path, dest_dir)
336
+
337
+ if not success:
338
+ self.stats['failed'] += 1
339
+ return False
340
+
341
+ # Log success
342
+ self.unzip_log['completed_files'][zip_key] = {
343
+ 'zip_path': str(zip_path),
344
+ 'dest_dir': str(dest_dir),
345
+ 'extracted_files': file_list[:100], # Limit log size
346
+ 'file_count': len(file_list),
347
+ 'unzipped_at': datetime.now().isoformat(),
348
+ }
349
+
350
+ self.stats['unzipped'] += 1
351
+
352
+ # Remove ZIP file if requested
353
+ if self.remove_zips:
354
+ zip_path.unlink()
355
+ self.stats['removed'] += 1
356
+
357
+ return True
358
+
359
+ except Exception as e:
360
+ logger.error(f"❌ Failed to unzip {zip_path.name}: {e}")
361
+ self.unzip_log['failed_files'][zip_key] = {
362
+ 'error': str(e),
363
+ 'failed_at': datetime.now().isoformat(),
364
+ }
365
+ self.stats['failed'] += 1
366
+ return False
367
+
368
+ def find_zip_files(
369
+ self,
370
+ categories: Optional[Set[str]] = None,
371
+ years: Optional[Set[str]] = None
372
+ ) -> List[Path]:
373
+ """
374
+ Find all ZIP files in bulk-downloads directory
375
+
376
+ Args:
377
+ categories: Optional set of categories to filter (e.g., {'candidate-master'})
378
+ years: Optional set of years to filter (e.g., {'2020', '2022', '2024'})
379
+
380
+ Returns:
381
+ List of ZIP file paths
382
+ """
383
+ zip_files = []
384
+
385
+ # Recursively find all .zip files
386
+ for zip_path in self.bulk_dir.rglob("*.zip"):
387
+ # Filter by category
388
+ if categories:
389
+ # Get category from path (e.g., bulk-downloads/candidate-master/2024/cn24.zip)
390
+ relative_path = zip_path.relative_to(self.bulk_dir)
391
+ category = relative_path.parts[0] if len(relative_path.parts) > 0 else None
392
+
393
+ if category not in categories:
394
+ continue
395
+
396
+ # Filter by year
397
+ if years:
398
+ # Get year from path (e.g., bulk-downloads/candidate-master/2024/cn24.zip)
399
+ relative_path = zip_path.relative_to(self.bulk_dir)
400
+ year = relative_path.parts[1] if len(relative_path.parts) > 1 else None
401
+
402
+ if year not in years:
403
+ continue
404
+
405
+ zip_files.append(zip_path)
406
+
407
+ return sorted(zip_files)
408
+
409
+ def unzip_all(
410
+ self,
411
+ categories: Optional[Set[str]] = None,
412
+ years: Optional[Set[str]] = None,
413
+ dry_run: bool = False
414
+ ):
415
+ """
416
+ Unzip all FEC bulk data files (with optional parallel processing)
417
+
418
+ Args:
419
+ categories: Optional set of categories to filter
420
+ years: Optional set of years to filter
421
+ dry_run: If True, don't actually unzip
422
+ """
423
+ logger.info("=" * 70)
424
+ logger.info("FEC BULK DATA UNZIPPER (HIGH-PERFORMANCE EDITION)")
425
+ logger.info("=" * 70)
426
+ logger.info(f"πŸ“‚ Source: {self.bulk_dir}")
427
+ logger.info(f"πŸ“ Destination: {self.unzipped_dir}")
428
+ logger.info(f"βš™οΈ Method: {'7-Zip' if self.use_7z else 'Python zipfile'}")
429
+ logger.info(f"πŸ‘· Workers: {self.workers} {'(parallel)' if self.workers > 1 else '(single-threaded)'}")
430
+ if categories:
431
+ logger.info(f"πŸ“‹ Categories: {', '.join(sorted(categories))}")
432
+ if years:
433
+ logger.info(f"πŸ“… Years: {', '.join(sorted(years))}")
434
+ if dry_run:
435
+ logger.warning("πŸ” DRY RUN MODE - No files will be unzipped")
436
+ logger.info("")
437
+
438
+ # Find all ZIP files
439
+ zip_files = self.find_zip_files(categories=categories, years=years)
440
+ self.stats['total_zips'] = len(zip_files)
441
+
442
+ if not zip_files:
443
+ logger.warning("⚠️ No ZIP files found")
444
+ return
445
+
446
+ logger.info(f"Found {len(zip_files)} ZIP files")
447
+ logger.info("")
448
+
449
+ # Prepare unzip tasks
450
+ tasks = []
451
+ for zip_path in zip_files:
452
+ relative_path = zip_path.relative_to(self.bulk_dir)
453
+ dest_dir = self.unzipped_dir / relative_path.parent / zip_path.stem
454
+ tasks.append((zip_path, dest_dir, dry_run))
455
+
456
+ # Execute unzipping (parallel or sequential)
457
+ if self.workers > 1 and not dry_run:
458
+ logger.info(f"πŸš€ Starting parallel extraction with {self.workers} workers")
459
+ self._unzip_parallel(tasks)
460
+ else:
461
+ logger.info(f"πŸ“¦ Starting sequential extraction")
462
+ for zip_path, dest_dir, dry_run in tqdm(tasks, desc="Unzipping", unit="file"):
463
+ self._unzip_file(zip_path, dest_dir, dry_run)
464
+
465
+ # Save log periodically
466
+ if not dry_run and self.stats['unzipped'] % 10 == 0:
467
+ self._save_log()
468
+
469
+ # Save final log
470
+ if not dry_run:
471
+ self._save_log()
472
+
473
+ # Print summary
474
+ logger.info("")
475
+ logger.info("=" * 70)
476
+ logger.info("SUMMARY")
477
+ logger.info("=" * 70)
478
+ logger.info(f"πŸ“Š Total ZIP files: {self.stats['total_zips']}")
479
+ logger.info(f"βœ… Unzipped: {self.stats['unzipped']}")
480
+ logger.info(f"⏭️ Skipped: {self.stats['skipped']}")
481
+ logger.info(f"❌ Failed: {self.stats['failed']}")
482
+ if self.remove_zips:
483
+ logger.info(f"πŸ—‘οΈ Removed: {self.stats['removed']}")
484
+ logger.info("")
485
+
486
+ if self.stats['failed'] > 0:
487
+ logger.warning("⚠️ Some files failed to unzip. Check unzip_log.json for details.")
488
+
489
+ def _unzip_parallel(self, tasks: List[Tuple[Path, Path, bool]]):
490
+ """Unzip files in parallel using ProcessPoolExecutor"""
491
+ # Prepare tasks with all necessary args for module-level worker
492
+ worker_tasks = [
493
+ (zip_path, dest_dir, dry_run, self.base_dir, self.use_7z, self.remove_zips, self.method)
494
+ for zip_path, dest_dir, dry_run in tasks
495
+ ]
496
+
497
+ with ProcessPoolExecutor(max_workers=self.workers) as executor:
498
+ # Submit all tasks
499
+ futures = {executor.submit(_unzip_worker, task): task[0] for task in worker_tasks}
500
+
501
+ # Track progress with tqdm
502
+ with tqdm(total=len(futures), desc="Unzipping (parallel)", unit="file") as pbar:
503
+ for future in as_completed(futures):
504
+ zip_path = futures[future]
505
+ try:
506
+ success, zip_path_result, dest_dir, file_count = future.result()
507
+ if success:
508
+ self.stats['unzipped'] += 1
509
+
510
+ # Log to unzip_log
511
+ zip_key = str(zip_path_result.relative_to(self.base_dir))
512
+ self.unzip_log['completed_files'][zip_key] = {
513
+ 'zip_path': str(zip_path_result),
514
+ 'dest_dir': str(dest_dir),
515
+ 'file_count': file_count,
516
+ 'unzipped_at': datetime.now().isoformat(),
517
+ }
518
+
519
+ if self.remove_zips:
520
+ self.stats['removed'] += 1
521
+ else:
522
+ self.stats['failed'] += 1
523
+ logger.error(f"❌ Failed to unzip {zip_path.name}")
524
+
525
+ # Log failure
526
+ zip_key = str(zip_path_result.relative_to(self.base_dir))
527
+ self.unzip_log['failed_files'][zip_key] = {
528
+ 'error': 'Extraction failed',
529
+ 'failed_at': datetime.now().isoformat(),
530
+ }
531
+ except Exception as e:
532
+ logger.error(f"❌ Worker exception for {zip_path.name}: {e}")
533
+ self.stats['failed'] += 1
534
+
535
+ pbar.update(1)
536
+
537
+ # Save log periodically
538
+ if self.stats['unzipped'] % 10 == 0:
539
+ self._save_log()
540
+
541
+
542
+ def main():
543
+ """Main entry point"""
544
+ parser = argparse.ArgumentParser(
545
+ description="Unzip FEC bulk data files (High-Performance Edition)",
546
+ formatter_class=argparse.RawDescriptionHelpFormatter,
547
+ epilog=__doc__
548
+ )
549
+
550
+ parser.add_argument(
551
+ '--base-dir',
552
+ type=Path,
553
+ default=Path('D:/fec_data'),
554
+ help='Base directory containing bulk-downloads/ (default: D:/fec_data)'
555
+ )
556
+
557
+ parser.add_argument(
558
+ '--category',
559
+ type=str,
560
+ help='Specific category to unzip (e.g., candidate-master, contributions-by-individuals)'
561
+ )
562
+
563
+ parser.add_argument(
564
+ '--years',
565
+ type=str,
566
+ help='Comma-separated list of years to unzip (e.g., 2020,2022,2024)'
567
+ )
568
+
569
+ parser.add_argument(
570
+ '--latest',
571
+ type=int,
572
+ help='Only unzip the latest N years (e.g., --latest 2 for most recent 2 years)'
573
+ )
574
+
575
+ parser.add_argument(
576
+ '--workers',
577
+ type=int,
578
+ default=1,
579
+ help='Number of parallel workers (default: 1, recommend: 4-8 for best performance)'
580
+ )
581
+
582
+ parser.add_argument(
583
+ '--method',
584
+ type=str,
585
+ default='auto',
586
+ choices=['python', '7z', 'auto'],
587
+ help='Extraction method: python (portable), 7z (2-3x faster), auto (use 7z if available)'
588
+ )
589
+
590
+ parser.add_argument(
591
+ '--resume',
592
+ action='store_true',
593
+ help='Skip already unzipped files'
594
+ )
595
+
596
+ parser.add_argument(
597
+ '--dry-run',
598
+ action='store_true',
599
+ help='Show what would be unzipped without actually unzipping'
600
+ )
601
+
602
+ parser.add_argument(
603
+ '--remove-zips',
604
+ action='store_true',
605
+ help='Remove ZIP files after successful extraction (saves 50%% disk space)'
606
+ )
607
+
608
+ args = parser.parse_args()
609
+
610
+ # Parse categories and years
611
+ categories = {args.category} if args.category else None
612
+ years = set(args.years.split(',')) if args.years else None
613
+
614
+ # Handle --latest option (auto-determine latest N years)
615
+ if args.latest:
616
+ if args.years:
617
+ logger.error("❌ Cannot use both --years and --latest options together")
618
+ sys.exit(1)
619
+
620
+ # Find all available years in the bulk-downloads directory
621
+ base_dir = Path(args.base_dir)
622
+ bulk_dir = base_dir / "bulk-downloads"
623
+
624
+ if not bulk_dir.exists():
625
+ logger.error(f"❌ Bulk downloads directory not found: {bulk_dir}")
626
+ sys.exit(1)
627
+
628
+ # Scan for all year directories
629
+ available_years = set()
630
+ for category_dir in bulk_dir.iterdir():
631
+ if category_dir.is_dir():
632
+ for year_dir in category_dir.iterdir():
633
+ if year_dir.is_dir() and year_dir.name.isdigit():
634
+ available_years.add(year_dir.name)
635
+
636
+ if not available_years:
637
+ logger.error("❌ No year directories found in bulk-downloads")
638
+ sys.exit(1)
639
+
640
+ # Get latest N years
641
+ sorted_years = sorted(available_years, reverse=True)
642
+ latest_years = sorted_years[:args.latest]
643
+ years = set(latest_years)
644
+
645
+ logger.info(f"πŸ“… Auto-selected latest {args.latest} years: {', '.join(sorted(latest_years, reverse=True))}")
646
+ logger.info("")
647
+
648
+ # Auto-detect optimal worker count if requested
649
+ if args.workers == 0:
650
+ args.workers = max(1, cpu_count() - 1)
651
+ logger.info(f"Auto-detected {args.workers} workers (CPU count: {cpu_count()})")
652
+
653
+ # Create unzipper
654
+ unzipper = FECBulkUnzipper(
655
+ base_dir=args.base_dir,
656
+ resume=args.resume,
657
+ remove_zips=args.remove_zips,
658
+ method=args.method,
659
+ workers=args.workers
660
+ )
661
+
662
+ # Unzip all files
663
+ unzipper.unzip_all(
664
+ categories=categories,
665
+ years=years,
666
+ dry_run=args.dry_run
667
+ )
668
+
669
+
670
+ if __name__ == '__main__':
671
+ main()