open-navigator / docs /UNIFIED_NONPROFIT_WORKFLOW.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

Unified Nonprofit Data Management

Single Source of Truth: data/gold/nonprofits_organizations.parquet

🎯 New Workflow

βœ… DO THIS (Single Unified File)

# Check stats
python scripts/manage_nonprofits.py stats

# Enrich a subset (updates main file in place)
python scripts/manage_nonprofits.py enrich-990 --states AL --sample 100

# Enrich specific orgs
python scripts/manage_nonprofits.py enrich-990 --ein-list eins.txt

# Enrich all Alabama + Michigan health nonprofits
python scripts/manage_nonprofits.py enrich-990 --states AL MI --ntee E

# BigQuery enrichment
python scripts/manage_nonprofits.py enrich-bigquery --states AL
# ... run SQL in BigQuery web UI, export CSV ...
python scripts/manage_nonprofits.py merge-bigquery

❌ DON'T DO THIS (Creates Separate Files)

# OLD WAY - Creates proliferation of files
python scripts/enrich_nonprofits_gt990.py \
    --input data/gold/nonprofits_tuscaloosa.parquet \
    --output data/gold/nonprofits_tuscaloosa_form990.parquet  # ❌ Extra file!

πŸ“Š Key Commands

Show Statistics

python scripts/manage_nonprofits.py stats

Output:

πŸ“Š TOTAL: 1,952,238 organizations
πŸ’° ENRICHMENT STATUS:
   Form 990 data: 307 (0.0%)
   BigQuery data: 0 (0.0%)
πŸ“ MISSION STATEMENTS:
   At least one source: 299 (0.0%)

Enrich Incrementally

By State:

python scripts/manage_nonprofits.py enrich-990 --states AL
# Updates main file with 62K Alabama nonprofits enriched

By NTEE Category:

python scripts/manage_nonprofits.py enrich-990 --ntee E
# Updates main file with 45K health nonprofits enriched

Combined Filters:

python scripts/manage_nonprofits.py enrich-990 --states AL MI --ntee E
# Updates main file with Alabama + Michigan health orgs

Test on Sample:

python scripts/manage_nonprofits.py enrich-990 --sample 100
# Updates main file with 100 random orgs enriched (for testing)

Specific EINs:

# Create file with EINs (one per line)
echo "631024890" > eins.txt
echo "631041304" >> eins.txt

python scripts/manage_nonprofits.py enrich-990 --ein-list eins.txt

πŸ”„ How In-Place Updates Work

  1. Load full dataset (1.9M orgs)
  2. Filter to subset (e.g., Alabama = 62K orgs)
  3. Enrich only the filtered subset
  4. Merge back:
    • Remove old data for those EINs
    • Add newly enriched data
    • Sort by EIN
  5. Save back to same file

Result: Only one file, incrementally enriched!

πŸ—‘οΈ Cleanup Old Files

Preview Cleanup (Dry Run)

python scripts/cleanup_nonprofit_files.py

Output:

πŸ”„ Found enriched data: nonprofits_tuscaloosa_form990.parquet
   βœ… Merged 921 enriched organizations

πŸ—‘οΈ  Found 9 file(s) to clean up:
   - nonprofits_tuscaloosa.parquet (0.0 MB)
   - nonprofits_tuscaloosa_form990.parquet (0.1 MB)
   - /tmp/test_*.parquet (0.1 MB)
   
⚠️  DRY RUN - No files will be deleted
   Run with --execute to actually delete files

Execute Cleanup

python scripts/cleanup_nonprofit_files.py --execute

What it does:

  • βœ… Merges any enrichment from old files into main file
  • βœ… Deletes old/redundant files
  • βœ… Leaves only data/gold/nonprofits_organizations.parquet

πŸ“ File Organization

βœ… Keep (Single Source of Truth)

data/gold/nonprofits_organizations.parquet  # 1.9M orgs, incrementally enriched

πŸ—‘οΈ Remove (Old Workflow)

data/gold/nonprofits_tuscaloosa.parquet              # ❌ Subset
data/gold/nonprofits_tuscaloosa_form990.parquet      # ❌ Enriched subset  
data/gold/nonprofits_990_enriched.parquet            # ❌ Another version
/tmp/test_*.parquet                                   # ❌ Test files

🎨 Progressive Enrichment Strategy

Enrich the dataset progressively to avoid overwhelming API limits:

Phase 1: Test Sample (TODAY)

python scripts/manage_nonprofits.py enrich-990 --sample 1000
# Test with 1K random orgs

Phase 2: Priority States (WEEK 1)

python scripts/manage_nonprofits.py enrich-990 --states AL MI
# Enrich Alabama + Michigan (118K orgs)

Phase 3: Priority NTEE (WEEK 2)

python scripts/manage_nonprofits.py enrich-990 --ntee E P
# Health + Human Services (199K orgs)

Phase 4: Remaining States (MONTH 1)

# Enrich 5-10 states per day
for state in CA TX NY FL PA; do
    python scripts/manage_nonprofits.py enrich-990 --states $state
    echo "Completed $state"
    sleep 3600  # Wait 1 hour between states
done

Phase 5: BigQuery Layer (MONTH 2)

# Add missions + websites from BigQuery
python scripts/manage_nonprofits.py enrich-bigquery
# ... export CSV ...
python scripts/manage_nonprofits.py merge-bigquery

πŸ” Advanced Usage

Check Enrichment Status by State

python -c "
import pandas as pd
df = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
enriched = df[df['form_990_status'] == 'found']
by_state = enriched.groupby('state').size().sort_values(ascending=False).head(10)
print('Top 10 states by Form 990 coverage:')
print(by_state)
"

Export EINs Needing Enrichment

python -c "
import pandas as pd
df = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
# Alabama health orgs without 990 data
needs_enrichment = df[
    (df['state'] == 'AL') & 
    (df['ntee_code'].str.startswith('E', na=False)) &
    (df['form_990_status'].isna())
]['ein']
needs_enrichment.to_csv('alabama_health_needs_enrichment.txt', index=False, header=False)
print(f'Exported {len(needs_enrichment)} EINs')
"

πŸ’‘ Benefits of Unified File

  1. βœ… Single source of truth - No confusion about which file is current
  2. βœ… Incremental updates - Add enrichment data without duplicating base data
  3. βœ… Smaller disk usage - No duplicate base data across files
  4. βœ… Easier tracking - One file to track, version, backup
  5. βœ… Simpler workflows - No need to manage file versions
  6. βœ… Better for git - One file to track changes in (with git-lfs)

πŸ“ Migration Checklist

  • Merge existing enrichment data into main file
  • Verify enrichment was merged (run stats)
  • Clean up old files with --execute
  • Update any scripts/docs referencing old files
  • Test new workflow with small sample
  • Document team workflow

πŸ†˜ Troubleshooting

"Main file not found"

# Create main file from EO-BMF data
python pipeline/create_gold_tables.py --nonprofits-only

"Lost my enrichment data!"

Don't worry! Run cleanup script first (without --execute) - it will merge any enrichment data before deleting files.

"Want to keep a backup"

# Backup before cleanup
cp data/gold/nonprofits_organizations.parquet \
   data/gold/nonprofits_organizations.$(date +%Y%m%d).parquet.bak

πŸ“š Related Documentation