Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| sidebar_position: 4 | |
| # IRS Bulk Data Integration | |
| Access **ALL 1.9M+ U.S. nonprofits** using the IRS Exempt Organizations Business Master File (EO-BMF). | |
| ## π― Why Use IRS Bulk Data? | |
| | Feature | ProPublica API | **IRS EO-BMF** | | |
| |---------|----------------|----------------| | |
| | **Coverage** | 25 results per request | **1,952,238 total** | | |
| | **Alabama nonprofits** | 25 | **26,148** | | |
| | **Pagination** | β Not available | β Complete dataset | | |
| | **Speed** | Slow (25 at a time) | β Fast (bulk download) | | |
| | **Cost** | Free | Free | | |
| | **Update frequency** | Real-time | Monthly | | |
| | **Data source** | IRS Form 990 | IRS official registry | | |
| **Result: IRS gives you 1,000x more data!** π | |
| --- | |
| ## π Data Source | |
| **IRS Exempt Organizations Business Master File (EO-BMF)** | |
| - **URL**: https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf | |
| - **Format**: CSV (pipe-delimited available too) | |
| - **Record Count**: 1,952,238 organizations (as of April 2026) | |
| - **Update Frequency**: Monthly | |
| - **License**: Public domain (U.S. government data) | |
| ### Regional Files | |
| The IRS provides 4 regional CSV files for faster download: | |
| 1. **Region 1 (Northeast)**: CT, ME, MA, NH, NJ, NY, RI, VT β 277,214 orgs | |
| 2. **Region 2 (Mid-Atlantic & Great Lakes)**: DE, DC, IL, IN, IA, KY, MD, MI, MN, NE, NC, ND, OH, PA, SC, SD, VA, WV, WI β 717,691 orgs | |
| 3. **Region 3 (Gulf Coast & Pacific)**: AL, AK, AZ, AR, CA, CO, FL, GA, HI, ID, KS, LA, MS, MO, MT, NV, NM, OK, OR, TX, TN, UT, WA, WY β 952,412 orgs | |
| 4. **Region 4 (All other)**: International, Puerto Rico β 4,921 orgs | |
| --- | |
| ## π Quick Start | |
| ### Download All 1.9M+ Nonprofits | |
| ```bash | |
| # Download ALL U.S. nonprofits (4 regional files) | |
| python scripts/create_all_gold_tables.py \ | |
| --nonprofits-only \ | |
| --use-irs \ | |
| --download-all-irs | |
| # Creates 4 gold tables: | |
| # - nonprofits_organizations.parquet (1.9M+ records) | |
| # - nonprofits_financials.parquet | |
| # - nonprofits_programs.parquet | |
| # - nonprofits_locations.parquet | |
| ``` | |
| **Download time**: ~30 seconds (first time), then instant (cached) | |
| --- | |
| ### Download Specific States | |
| ```bash | |
| # Download Alabama nonprofits only | |
| python scripts/create_all_gold_tables.py \ | |
| --nonprofits-only \ | |
| --states AL \ | |
| --use-irs | |
| # Result: 26,148 Alabama nonprofits | |
| ``` | |
| ```bash | |
| # Download multiple states | |
| python scripts/create_all_gold_tables.py \ | |
| --nonprofits-only \ | |
| --states AL GA FL MS TN \ | |
| --use-irs | |
| # Result: ~100,000+ nonprofits from 5 states | |
| ``` | |
| --- | |
| ### Filter by NTEE Code | |
| ```bash | |
| # Get only health organizations (NTEE E) from Alabama | |
| python scripts/create_all_gold_tables.py \ | |
| --nonprofits-only \ | |
| --states AL \ | |
| --ntee-codes E \ | |
| --use-irs | |
| # Result: 509 health nonprofits in Alabama | |
| ``` | |
| ```bash | |
| # Get health + human services from all states | |
| python scripts/create_all_gold_tables.py \ | |
| --nonprofits-only \ | |
| --ntee-codes E P \ | |
| --use-irs \ | |
| --download-all-irs | |
| # Result: ~400,000+ health & human service orgs nationwide | |
| ``` | |
| --- | |
| ## π» Python API Usage | |
| ### Example 1: Download All Regions | |
| ```python | |
| from discovery.irs_bmf_ingestion import IRSBMFIngestion | |
| irs = IRSBMFIngestion() | |
| # Download all 1.9M+ nonprofits (4 regional files) | |
| df = irs.download_all_regions() | |
| print(f"Downloaded {len(df):,} nonprofits") | |
| # Output: Downloaded 1,952,238 nonprofits | |
| # Data is automatically cached to: data/cache/irs_bmf/all_regions_combined.parquet | |
| # Future runs will load from cache (instant!) | |
| ``` | |
| ### Example 2: Download Specific State | |
| ```python | |
| from discovery.irs_bmf_ingestion import IRSBMFIngestion | |
| irs = IRSBMFIngestion() | |
| # Download Alabama | |
| df_alabama = irs.download_state_file("AL") | |
| print(f"Alabama: {len(df_alabama):,} nonprofits") | |
| # Output: Alabama: 26,148 nonprofits | |
| # Download California | |
| df_california = irs.download_state_file("CA") | |
| print(f"California: {len(df_california):,} nonprofits") | |
| # Output: California: ~200,000 nonprofits | |
| ``` | |
| ### Example 3: Filter by NTEE Code | |
| ```python | |
| from discovery.irs_bmf_ingestion import IRSBMFIngestion | |
| irs = IRSBMFIngestion() | |
| # Download all regions | |
| df_all = irs.download_all_regions() | |
| # Filter to health organizations (NTEE E) | |
| df_health = irs.filter_by_ntee(df_all, ["E"]) | |
| print(f"Health organizations: {len(df_health):,}") | |
| # Output: Health organizations: ~80,000 | |
| # Filter to multiple NTEE codes | |
| df_community = irs.filter_by_ntee(df_all, ["E", "P", "K", "L", "S", "W"]) | |
| print(f"Community service orgs: {len(df_community):,}") | |
| # Output: Community service orgs: ~600,000 | |
| ``` | |
| ### Example 4: Combine State + NTEE Filtering | |
| ```python | |
| from discovery.irs_bmf_ingestion import IRSBMFIngestion | |
| irs = IRSBMFIngestion() | |
| # Download Alabama | |
| df = irs.download_state_file("AL") | |
| # Filter to health orgs | |
| health = irs.filter_by_ntee(df, ["E"]) | |
| # Convert to ProPublica format | |
| standardized = irs.standardize_to_propublica_format(health) | |
| # Save to gold table | |
| standardized.to_parquet("data/gold/alabama_health_nonprofits.parquet") | |
| ``` | |
| --- | |
| ## π Data Schema | |
| ### IRS EO-BMF Columns | |
| The IRS provides 28 columns per organization: | |
| | Column | Description | Example | | |
| |--------|-------------|---------| | |
| | `ein` | Employer Identification Number | `630123456` | | |
| | `name` | Organization name | `Good Samaritan Health Clinic` | | |
| | `street` | Street address | `123 Main St` | | |
| | `city` | City | `Birmingham` | | |
| | `state` | 2-letter state code | `AL` | | |
| | `zip` | ZIP code | `35203` | | |
| | `ntee_cd` | NTEE classification code | `E30` (Ambulatory Health) | | |
| | `subsection` | 501(c) subsection | `03` = 501(c)(3) | | |
| | `asset_amt` | Asset amount | `4467751` | | |
| | `income_amt` | Income amount | `2500000` | | |
| | `revenue_amt` | Revenue amount (Form 990) | `2500000` | | |
| | `ruling` | Month/year of ruling letter | `200501` (Jan 2005) | | |
| | `deductibility` | Deductibility status code | `1` = Deductible | | |
| | `foundation` | Foundation code | `15` = Public charity | | |
| | `activity` | Activity codes | `000` | | |
| | `organization` | Organization code | `1` = Corporation | | |
| | `status` | Exempt org status code | `1` = Unconditional | | |
| | ... | 13 more columns | ... | | |
| **Full data dictionary**: https://www.irs.gov/pub/foia/ig/tege/eo-info.pdf | |
| --- | |
| ## π Integration with Existing Pipeline | |
| The IRS ingestion module integrates seamlessly with our existing ProPublica-based pipeline: | |
| ```python | |
| from pipeline.create_nonprofits_gold_tables import NonprofitGoldTableCreator | |
| # Create pipeline with IRS support | |
| creator = NonprofitGoldTableCreator() | |
| # Option 1: Use IRS for specific states | |
| creator.create_all_gold_tables( | |
| states=["AL", "GA", "FL"], | |
| use_irs_data=True # β Use IRS instead of ProPublica | |
| ) | |
| # Option 2: Download ALL nonprofits | |
| creator.create_all_gold_tables( | |
| use_irs_data=True, | |
| download_all_irs=True # β Get all 1.9M+ orgs | |
| ) | |
| # Option 3: Filter by NTEE codes | |
| creator.create_all_gold_tables( | |
| states=["AL"], | |
| ntee_codes=["E", "P"], # Health + Human Services | |
| use_irs_data=True | |
| ) | |
| ``` | |
| ### Standardization | |
| IRS data is automatically converted to ProPublica-compatible format: | |
| ```python | |
| # IRS columns β ProPublica schema | |
| { | |
| 'ein': df.get('ein'), | |
| 'name': df.get('name'), | |
| 'city': df.get('city'), | |
| 'state': df.get('state'), | |
| 'ntee_code': df.get('ntee_cd'), | |
| 'asset_amount': df.get('asset_amt'), | |
| 'income_amount': df.get('income_amt'), | |
| 'street_address': df.get('street'), | |
| 'zip_code': df.get('zip'), | |
| 'data_source': 'IRS_EO_BMF' # Track source | |
| } | |
| ``` | |
| This allows you to: | |
| - β Mix IRS + ProPublica data | |
| - β Use same gold table schema | |
| - β Switch between sources without changing downstream code | |
| --- | |
| ## π NTEE Codes Reference | |
| Common NTEE codes for community services: | |
| | Code | Category | Example Organizations | | |
| |------|----------|----------------------| | |
| | **E** | Health | Hospitals, clinics, mental health | | |
| | **E30** | Ambulatory Health Center | Community health centers | | |
| | **E32** | School-Based Health Care | School clinics | | |
| | **E60** | Health Support Services | Medical equipment, patient support | | |
| | **E70** | Public Health Program | Disease prevention, health education | | |
| | **P** | Human Services | Food banks, shelters, counseling | | |
| | **P20** | Human Service Organizations | Multi-service agencies | | |
| | **K** | Food, Agriculture | Food pantries, nutrition programs | | |
| | **L** | Housing, Shelter | Homeless shelters, affordable housing | | |
| | **S** | Community Improvement | Community development, civic groups | | |
| | **W** | Public Affairs | Advocacy, civil rights, voting | | |
| **Full NTEE taxonomy**: https://nccs.urban.org/project/national-taxonomy-exempt-entities-ntee-codes | |
| --- | |
| ## π Performance Benchmarks | |
| Tested on standard cloud VM (4 vCPU, 16 GB RAM): | |
| | Operation | Time | Records | File Size | | |
| |-----------|------|---------|-----------| | |
| | Download Region 1 | ~4 sec | 277,214 | 25 MB | | |
| | Download Region 2 | ~3 sec | 717,691 | 60 MB | | |
| | Download Region 3 | ~5 sec | 952,412 | 80 MB | | |
| | Download Region 4 | ~1 sec | 4,921 | 1 MB | | |
| | **Download ALL 4 regions** | **~30 sec** | **1,952,238** | **170 MB** | | |
| | Load from cache (parquet) | ~1 sec | 1,952,238 | 120 MB | | |
| | Filter by NTEE (health) | ~2 sec | ~80,000 | 6 MB | | |
| | Create 4 gold tables (AL) | ~6 sec | 26,148 | 4 MB | | |
| | Create 4 gold tables (ALL) | ~5 min | 1,952,238 | 250 MB | | |
| **Recommendation**: Always download all regions on first run, then filter locally. Much faster than downloading individual states! | |
| --- | |
| ## π When to Use IRS vs ProPublica | |
| ### Use IRS EO-BMF When: | |
| β You need comprehensive coverage (all nonprofits in a state) | |
| β You're doing bulk analysis (e.g., "all health orgs in Southeast") | |
| β You need offline access to data | |
| β You want faster performance (bulk downloads) | |
| β You're building a complete nonprofit directory | |
| ### Use ProPublica API When: | |
| β You need real-time updates (IRS is monthly) | |
| β You want detailed Form 990 financial breakdowns | |
| β You need executive compensation data | |
| β You want mission statements (IRS doesn't have these) | |
| β You're searching for a specific organization by name | |
| ### Best Practice: Use Both! | |
| 1. **IRS** for bulk discovery and coverage | |
| 2. **ProPublica** for enrichment with detailed financials | |
| ```python | |
| # 1. Download all Alabama orgs from IRS | |
| irs = IRSBMFIngestion() | |
| df_all = irs.download_state_file("AL") # 26,148 orgs | |
| # 2. Enrich top 100 with ProPublica details | |
| propublica = NonprofitDiscovery() | |
| for ein in df_all.head(100)['ein']: | |
| details = propublica.get_propublica_org_details(ein) | |
| # details contains mission, programs, detailed financials | |
| ``` | |
| --- | |
| ## π§ Troubleshooting | |
| ### Download Fails with Timeout Error | |
| ```python | |
| # Increase timeout | |
| irs = IRSBMFIngestion() | |
| # Edit download_regional_file() timeout parameter (default: 300 seconds) | |
| ``` | |
| ### Out of Memory Error | |
| ```python | |
| # Process states individually instead of all regions | |
| for state in ["AL", "GA", "FL"]: | |
| df = irs.download_state_file(state) | |
| # Process each state separately | |
| ``` | |
| ### Need Fresh Data | |
| ```python | |
| # Force refresh (bypass cache) | |
| df = irs.download_all_regions(force_refresh=True) | |
| ``` | |
| --- | |
| ## π Related Documentation | |
| - [ProPublica API](./citations.md#propublica-nonprofit-explorer) β Alternative API-based source | |
| - [Nonprofit Discovery Module](../../discovery/README_NONPROFIT_DISCOVERY.md) β ProPublica integration | |
| - [Gold Table Pipeline](../guides/gold-table-pipeline.md) β How data flows to gold tables | |
| - [NTEE Codes Reference](https://nccs.urban.org/project/national-taxonomy-exempt-entities-ntee-codes) β Understanding nonprofit classifications | |
| --- | |
| ## π― Citation | |
| When using IRS EO-BMF data in publications: | |
| ```bibtex | |
| @misc{irs_eobmf_2026, | |
| title = {Exempt Organizations Business Master File Extract (EO-BMF)}, | |
| author = {{Internal Revenue Service}}, | |
| year = {2026}, | |
| month = {April}, | |
| url = {https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf}, | |
| note = {Accessed: 2026-04-27. Record count: 1,952,238 organizations.} | |
| } | |
| ``` | |
| --- | |
| ## β¨ Key Takeaways | |
| π― **IRS EO-BMF provides ALL 1.9M+ U.S. nonprofits** | |
| β‘ **1,000x more data than ProPublica API per request** | |
| πΎ **Downloads in ~30 seconds, cached for instant future access** | |
| π **Seamlessly integrates with existing pipeline** | |
| π **Updated monthly by the IRS** | |
| π **Completely free, public domain data** | |
| **Start using it today!** | |
| ```bash | |
| python scripts/create_all_gold_tables.py --nonprofits-only --use-irs --download-all-irs | |
| ``` | |