Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 7,172 Bytes
896453f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | # Unified Nonprofit Data Management
**Single Source of Truth**: `data/gold/nonprofits_organizations.parquet`
## π― **New Workflow**
### β
**DO THIS** (Single Unified File)
```bash
# Check stats
python scripts/manage_nonprofits.py stats
# Enrich a subset (updates main file in place)
python scripts/manage_nonprofits.py enrich-990 --states AL --sample 100
# Enrich specific orgs
python scripts/manage_nonprofits.py enrich-990 --ein-list eins.txt
# Enrich all Alabama + Michigan health nonprofits
python scripts/manage_nonprofits.py enrich-990 --states AL MI --ntee E
# BigQuery enrichment
python scripts/manage_nonprofits.py enrich-bigquery --states AL
# ... run SQL in BigQuery web UI, export CSV ...
python scripts/manage_nonprofits.py merge-bigquery
```
### β **DON'T DO THIS** (Creates Separate Files)
```bash
# OLD WAY - Creates proliferation of files
python scripts/enrich_nonprofits_gt990.py \
--input data/gold/nonprofits_tuscaloosa.parquet \
--output data/gold/nonprofits_tuscaloosa_form990.parquet # β Extra file!
```
## π **Key Commands**
### Show Statistics
```bash
python scripts/manage_nonprofits.py stats
```
Output:
```
π TOTAL: 1,952,238 organizations
π° ENRICHMENT STATUS:
Form 990 data: 307 (0.0%)
BigQuery data: 0 (0.0%)
π MISSION STATEMENTS:
At least one source: 299 (0.0%)
```
### Enrich Incrementally
**By State:**
```bash
python scripts/manage_nonprofits.py enrich-990 --states AL
# Updates main file with 62K Alabama nonprofits enriched
```
**By NTEE Category:**
```bash
python scripts/manage_nonprofits.py enrich-990 --ntee E
# Updates main file with 45K health nonprofits enriched
```
**Combined Filters:**
```bash
python scripts/manage_nonprofits.py enrich-990 --states AL MI --ntee E
# Updates main file with Alabama + Michigan health orgs
```
**Test on Sample:**
```bash
python scripts/manage_nonprofits.py enrich-990 --sample 100
# Updates main file with 100 random orgs enriched (for testing)
```
**Specific EINs:**
```bash
# Create file with EINs (one per line)
echo "631024890" > eins.txt
echo "631041304" >> eins.txt
python scripts/manage_nonprofits.py enrich-990 --ein-list eins.txt
```
## π **How In-Place Updates Work**
1. **Load** full dataset (1.9M orgs)
2. **Filter** to subset (e.g., Alabama = 62K orgs)
3. **Enrich** only the filtered subset
4. **Merge** back:
- Remove old data for those EINs
- Add newly enriched data
- Sort by EIN
5. **Save** back to same file
**Result:** Only one file, incrementally enriched!
## ποΈ **Cleanup Old Files**
### Preview Cleanup (Dry Run)
```bash
python scripts/cleanup_nonprofit_files.py
```
Output:
```
π Found enriched data: nonprofits_tuscaloosa_form990.parquet
β
Merged 921 enriched organizations
ποΈ Found 9 file(s) to clean up:
- nonprofits_tuscaloosa.parquet (0.0 MB)
- nonprofits_tuscaloosa_form990.parquet (0.1 MB)
- /tmp/test_*.parquet (0.1 MB)
β οΈ DRY RUN - No files will be deleted
Run with --execute to actually delete files
```
### Execute Cleanup
```bash
python scripts/cleanup_nonprofit_files.py --execute
```
**What it does:**
- β
Merges any enrichment from old files into main file
- β
Deletes old/redundant files
- β
Leaves only `data/gold/nonprofits_organizations.parquet`
## π **File Organization**
### β
**Keep (Single Source of Truth)**
```
data/gold/nonprofits_organizations.parquet # 1.9M orgs, incrementally enriched
```
### ποΈ **Remove (Old Workflow)**
```
data/gold/nonprofits_tuscaloosa.parquet # β Subset
data/gold/nonprofits_tuscaloosa_form990.parquet # β Enriched subset
data/gold/nonprofits_990_enriched.parquet # β Another version
/tmp/test_*.parquet # β Test files
```
## π¨ **Progressive Enrichment Strategy**
Enrich the dataset progressively to avoid overwhelming API limits:
### Phase 1: Test Sample (TODAY)
```bash
python scripts/manage_nonprofits.py enrich-990 --sample 1000
# Test with 1K random orgs
```
### Phase 2: Priority States (WEEK 1)
```bash
python scripts/manage_nonprofits.py enrich-990 --states AL MI
# Enrich Alabama + Michigan (118K orgs)
```
### Phase 3: Priority NTEE (WEEK 2)
```bash
python scripts/manage_nonprofits.py enrich-990 --ntee E P
# Health + Human Services (199K orgs)
```
### Phase 4: Remaining States (MONTH 1)
```bash
# Enrich 5-10 states per day
for state in CA TX NY FL PA; do
python scripts/manage_nonprofits.py enrich-990 --states $state
echo "Completed $state"
sleep 3600 # Wait 1 hour between states
done
```
### Phase 5: BigQuery Layer (MONTH 2)
```bash
# Add missions + websites from BigQuery
python scripts/manage_nonprofits.py enrich-bigquery
# ... export CSV ...
python scripts/manage_nonprofits.py merge-bigquery
```
## π **Advanced Usage**
### Check Enrichment Status by State
```bash
python -c "
import pandas as pd
df = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
enriched = df[df['form_990_status'] == 'found']
by_state = enriched.groupby('state').size().sort_values(ascending=False).head(10)
print('Top 10 states by Form 990 coverage:')
print(by_state)
"
```
### Export EINs Needing Enrichment
```bash
python -c "
import pandas as pd
df = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
# Alabama health orgs without 990 data
needs_enrichment = df[
(df['state'] == 'AL') &
(df['ntee_code'].str.startswith('E', na=False)) &
(df['form_990_status'].isna())
]['ein']
needs_enrichment.to_csv('alabama_health_needs_enrichment.txt', index=False, header=False)
print(f'Exported {len(needs_enrichment)} EINs')
"
```
## π‘ **Benefits of Unified File**
1. β
**Single source of truth** - No confusion about which file is current
2. β
**Incremental updates** - Add enrichment data without duplicating base data
3. β
**Smaller disk usage** - No duplicate base data across files
4. β
**Easier tracking** - One file to track, version, backup
5. β
**Simpler workflows** - No need to manage file versions
6. β
**Better for git** - One file to track changes in (with git-lfs)
## π **Migration Checklist**
- [x] **Merge** existing enrichment data into main file
- [x] **Verify** enrichment was merged (run `stats`)
- [ ] **Clean up** old files with `--execute`
- [ ] **Update** any scripts/docs referencing old files
- [ ] **Test** new workflow with small sample
- [ ] **Document** team workflow
## π **Troubleshooting**
### "Main file not found"
```bash
# Create main file from EO-BMF data
python pipeline/create_gold_tables.py --nonprofits-only
```
### "Lost my enrichment data!"
Don't worry! Run cleanup script first (without `--execute`) - it will merge any enrichment data before deleting files.
### "Want to keep a backup"
```bash
# Backup before cleanup
cp data/gold/nonprofits_organizations.parquet \
data/gold/nonprofits_organizations.$(date +%Y%m%d).parquet.bak
```
## π **Related Documentation**
- [Form 990 Enrichment](website/docs/data-sources/form-990-xml.md)
- [BigQuery Integration](docs/BIGQUERY_ENRICHMENT.md)
- [Charity Navigator](website/docs/data-sources/charity-navigator.md)
|