File size: 7,172 Bytes
896453f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# Unified Nonprofit Data Management

**Single Source of Truth**: `data/gold/nonprofits_organizations.parquet`

## 🎯 **New Workflow**

### βœ… **DO THIS** (Single Unified File)

```bash
# Check stats
python scripts/manage_nonprofits.py stats

# Enrich a subset (updates main file in place)
python scripts/manage_nonprofits.py enrich-990 --states AL --sample 100

# Enrich specific orgs
python scripts/manage_nonprofits.py enrich-990 --ein-list eins.txt

# Enrich all Alabama + Michigan health nonprofits
python scripts/manage_nonprofits.py enrich-990 --states AL MI --ntee E

# BigQuery enrichment
python scripts/manage_nonprofits.py enrich-bigquery --states AL
# ... run SQL in BigQuery web UI, export CSV ...
python scripts/manage_nonprofits.py merge-bigquery
```

### ❌ **DON'T DO THIS** (Creates Separate Files)

```bash
# OLD WAY - Creates proliferation of files
python scripts/enrich_nonprofits_gt990.py \
    --input data/gold/nonprofits_tuscaloosa.parquet \
    --output data/gold/nonprofits_tuscaloosa_form990.parquet  # ❌ Extra file!
```

## πŸ“Š **Key Commands**

### Show Statistics

```bash
python scripts/manage_nonprofits.py stats
```

Output:
```
πŸ“Š TOTAL: 1,952,238 organizations
πŸ’° ENRICHMENT STATUS:
   Form 990 data: 307 (0.0%)
   BigQuery data: 0 (0.0%)
πŸ“ MISSION STATEMENTS:
   At least one source: 299 (0.0%)
```

### Enrich Incrementally

**By State:**
```bash
python scripts/manage_nonprofits.py enrich-990 --states AL
# Updates main file with 62K Alabama nonprofits enriched
```

**By NTEE Category:**
```bash
python scripts/manage_nonprofits.py enrich-990 --ntee E
# Updates main file with 45K health nonprofits enriched
```

**Combined Filters:**
```bash
python scripts/manage_nonprofits.py enrich-990 --states AL MI --ntee E
# Updates main file with Alabama + Michigan health orgs
```

**Test on Sample:**
```bash
python scripts/manage_nonprofits.py enrich-990 --sample 100
# Updates main file with 100 random orgs enriched (for testing)
```

**Specific EINs:**
```bash
# Create file with EINs (one per line)
echo "631024890" > eins.txt
echo "631041304" >> eins.txt

python scripts/manage_nonprofits.py enrich-990 --ein-list eins.txt
```

## πŸ”„ **How In-Place Updates Work**

1. **Load** full dataset (1.9M orgs)
2. **Filter** to subset (e.g., Alabama = 62K orgs)
3. **Enrich** only the filtered subset
4. **Merge** back:
   - Remove old data for those EINs
   - Add newly enriched data
   - Sort by EIN
5. **Save** back to same file

**Result:** Only one file, incrementally enriched!

## πŸ—‘οΈ **Cleanup Old Files**

### Preview Cleanup (Dry Run)

```bash
python scripts/cleanup_nonprofit_files.py
```

Output:
```
πŸ”„ Found enriched data: nonprofits_tuscaloosa_form990.parquet
   βœ… Merged 921 enriched organizations

πŸ—‘οΈ  Found 9 file(s) to clean up:
   - nonprofits_tuscaloosa.parquet (0.0 MB)
   - nonprofits_tuscaloosa_form990.parquet (0.1 MB)
   - /tmp/test_*.parquet (0.1 MB)
   
⚠️  DRY RUN - No files will be deleted
   Run with --execute to actually delete files
```

### Execute Cleanup

```bash
python scripts/cleanup_nonprofit_files.py --execute
```

**What it does:**
- βœ… Merges any enrichment from old files into main file
- βœ… Deletes old/redundant files
- βœ… Leaves only `data/gold/nonprofits_organizations.parquet`

## πŸ“ **File Organization**

### βœ… **Keep (Single Source of Truth)**

```
data/gold/nonprofits_organizations.parquet  # 1.9M orgs, incrementally enriched
```

### πŸ—‘οΈ **Remove (Old Workflow)**

```
data/gold/nonprofits_tuscaloosa.parquet              # ❌ Subset
data/gold/nonprofits_tuscaloosa_form990.parquet      # ❌ Enriched subset  
data/gold/nonprofits_990_enriched.parquet            # ❌ Another version
/tmp/test_*.parquet                                   # ❌ Test files
```

## 🎨 **Progressive Enrichment Strategy**

Enrich the dataset progressively to avoid overwhelming API limits:

### Phase 1: Test Sample (TODAY)
```bash
python scripts/manage_nonprofits.py enrich-990 --sample 1000
# Test with 1K random orgs
```

### Phase 2: Priority States (WEEK 1)
```bash
python scripts/manage_nonprofits.py enrich-990 --states AL MI
# Enrich Alabama + Michigan (118K orgs)
```

### Phase 3: Priority NTEE (WEEK 2)
```bash
python scripts/manage_nonprofits.py enrich-990 --ntee E P
# Health + Human Services (199K orgs)
```

### Phase 4: Remaining States (MONTH 1)
```bash
# Enrich 5-10 states per day
for state in CA TX NY FL PA; do
    python scripts/manage_nonprofits.py enrich-990 --states $state
    echo "Completed $state"
    sleep 3600  # Wait 1 hour between states
done
```

### Phase 5: BigQuery Layer (MONTH 2)
```bash
# Add missions + websites from BigQuery
python scripts/manage_nonprofits.py enrich-bigquery
# ... export CSV ...
python scripts/manage_nonprofits.py merge-bigquery
```

## πŸ” **Advanced Usage**

### Check Enrichment Status by State

```bash
python -c "
import pandas as pd
df = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
enriched = df[df['form_990_status'] == 'found']
by_state = enriched.groupby('state').size().sort_values(ascending=False).head(10)
print('Top 10 states by Form 990 coverage:')
print(by_state)
"
```

### Export EINs Needing Enrichment

```bash
python -c "
import pandas as pd
df = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
# Alabama health orgs without 990 data
needs_enrichment = df[
    (df['state'] == 'AL') & 
    (df['ntee_code'].str.startswith('E', na=False)) &
    (df['form_990_status'].isna())
]['ein']
needs_enrichment.to_csv('alabama_health_needs_enrichment.txt', index=False, header=False)
print(f'Exported {len(needs_enrichment)} EINs')
"
```

## πŸ’‘ **Benefits of Unified File**

1. βœ… **Single source of truth** - No confusion about which file is current
2. βœ… **Incremental updates** - Add enrichment data without duplicating base data
3. βœ… **Smaller disk usage** - No duplicate base data across files
4. βœ… **Easier tracking** - One file to track, version, backup
5. βœ… **Simpler workflows** - No need to manage file versions
6. βœ… **Better for git** - One file to track changes in (with git-lfs)

## πŸ“ **Migration Checklist**

- [x] **Merge** existing enrichment data into main file
- [x] **Verify** enrichment was merged (run `stats`)
- [ ] **Clean up** old files with `--execute`
- [ ] **Update** any scripts/docs referencing old files
- [ ] **Test** new workflow with small sample
- [ ] **Document** team workflow

## πŸ†˜ **Troubleshooting**

### "Main file not found"

```bash
# Create main file from EO-BMF data
python pipeline/create_gold_tables.py --nonprofits-only
```

### "Lost my enrichment data!"

Don't worry! Run cleanup script first (without `--execute`) - it will merge any enrichment data before deleting files.

### "Want to keep a backup"

```bash
# Backup before cleanup
cp data/gold/nonprofits_organizations.parquet \
   data/gold/nonprofits_organizations.$(date +%Y%m%d).parquet.bak
```

## πŸ“š **Related Documentation**

- [Form 990 Enrichment](website/docs/data-sources/form-990-xml.md)
- [BigQuery Integration](docs/BIGQUERY_ENRICHMENT.md)
- [Charity Navigator](website/docs/data-sources/charity-navigator.md)