open-navigator / website /docs /data-sources /irs-bulk-data.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
metadata
sidebar_position: 4

IRS Bulk Data Integration

Access ALL 1.9M+ U.S. nonprofits using the IRS Exempt Organizations Business Master File (EO-BMF).

🎯 Why Use IRS Bulk Data?

Feature ProPublica API IRS EO-BMF
Coverage 25 results per request 1,952,238 total
Alabama nonprofits 25 26,148
Pagination ❌ Not available βœ… Complete dataset
Speed Slow (25 at a time) βœ… Fast (bulk download)
Cost Free Free
Update frequency Real-time Monthly
Data source IRS Form 990 IRS official registry

Result: IRS gives you 1,000x more data! πŸš€


πŸ“Š Data Source

IRS Exempt Organizations Business Master File (EO-BMF)

Regional Files

The IRS provides 4 regional CSV files for faster download:

  1. Region 1 (Northeast): CT, ME, MA, NH, NJ, NY, RI, VT β€” 277,214 orgs
  2. Region 2 (Mid-Atlantic & Great Lakes): DE, DC, IL, IN, IA, KY, MD, MI, MN, NE, NC, ND, OH, PA, SC, SD, VA, WV, WI β€” 717,691 orgs
  3. Region 3 (Gulf Coast & Pacific): AL, AK, AZ, AR, CA, CO, FL, GA, HI, ID, KS, LA, MS, MO, MT, NV, NM, OK, OR, TX, TN, UT, WA, WY β€” 952,412 orgs
  4. Region 4 (All other): International, Puerto Rico β€” 4,921 orgs

πŸš€ Quick Start

Download All 1.9M+ Nonprofits

# Download ALL U.S. nonprofits (4 regional files)
python scripts/create_all_gold_tables.py \
  --nonprofits-only \
  --use-irs \
  --download-all-irs

# Creates 4 gold tables:
# - nonprofits_organizations.parquet (1.9M+ records)
# - nonprofits_financials.parquet
# - nonprofits_programs.parquet
# - nonprofits_locations.parquet

Download time: ~30 seconds (first time), then instant (cached)


Download Specific States

# Download Alabama nonprofits only
python scripts/create_all_gold_tables.py \
  --nonprofits-only \
  --states AL \
  --use-irs

# Result: 26,148 Alabama nonprofits
# Download multiple states
python scripts/create_all_gold_tables.py \
  --nonprofits-only \
  --states AL GA FL MS TN \
  --use-irs

# Result: ~100,000+ nonprofits from 5 states

Filter by NTEE Code

# Get only health organizations (NTEE E) from Alabama
python scripts/create_all_gold_tables.py \
  --nonprofits-only \
  --states AL \
  --ntee-codes E \
  --use-irs

# Result: 509 health nonprofits in Alabama
# Get health + human services from all states
python scripts/create_all_gold_tables.py \
  --nonprofits-only \
  --ntee-codes E P \
  --use-irs \
  --download-all-irs

# Result: ~400,000+ health & human service orgs nationwide

πŸ’» Python API Usage

Example 1: Download All Regions

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download all 1.9M+ nonprofits (4 regional files)
df = irs.download_all_regions()

print(f"Downloaded {len(df):,} nonprofits")
# Output: Downloaded 1,952,238 nonprofits

# Data is automatically cached to: data/cache/irs_bmf/all_regions_combined.parquet
# Future runs will load from cache (instant!)

Example 2: Download Specific State

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download Alabama
df_alabama = irs.download_state_file("AL")
print(f"Alabama: {len(df_alabama):,} nonprofits")
# Output: Alabama: 26,148 nonprofits

# Download California
df_california = irs.download_state_file("CA")
print(f"California: {len(df_california):,} nonprofits")
# Output: California: ~200,000 nonprofits

Example 3: Filter by NTEE Code

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download all regions
df_all = irs.download_all_regions()

# Filter to health organizations (NTEE E)
df_health = irs.filter_by_ntee(df_all, ["E"])
print(f"Health organizations: {len(df_health):,}")
# Output: Health organizations: ~80,000

# Filter to multiple NTEE codes
df_community = irs.filter_by_ntee(df_all, ["E", "P", "K", "L", "S", "W"])
print(f"Community service orgs: {len(df_community):,}")
# Output: Community service orgs: ~600,000

Example 4: Combine State + NTEE Filtering

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download Alabama
df = irs.download_state_file("AL")

# Filter to health orgs
health = irs.filter_by_ntee(df, ["E"])

# Convert to ProPublica format
standardized = irs.standardize_to_propublica_format(health)

# Save to gold table
standardized.to_parquet("data/gold/alabama_health_nonprofits.parquet")

πŸ“‹ Data Schema

IRS EO-BMF Columns

The IRS provides 28 columns per organization:

Column Description Example
ein Employer Identification Number 630123456
name Organization name Good Samaritan Health Clinic
street Street address 123 Main St
city City Birmingham
state 2-letter state code AL
zip ZIP code 35203
ntee_cd NTEE classification code E30 (Ambulatory Health)
subsection 501(c) subsection 03 = 501(c)(3)
asset_amt Asset amount 4467751
income_amt Income amount 2500000
revenue_amt Revenue amount (Form 990) 2500000
ruling Month/year of ruling letter 200501 (Jan 2005)
deductibility Deductibility status code 1 = Deductible
foundation Foundation code 15 = Public charity
activity Activity codes 000
organization Organization code 1 = Corporation
status Exempt org status code 1 = Unconditional
... 13 more columns ...

Full data dictionary: https://www.irs.gov/pub/foia/ig/tege/eo-info.pdf


πŸ”— Integration with Existing Pipeline

The IRS ingestion module integrates seamlessly with our existing ProPublica-based pipeline:

from pipeline.create_nonprofits_gold_tables import NonprofitGoldTableCreator

# Create pipeline with IRS support
creator = NonprofitGoldTableCreator()

# Option 1: Use IRS for specific states
creator.create_all_gold_tables(
    states=["AL", "GA", "FL"],
    use_irs_data=True  # ← Use IRS instead of ProPublica
)

# Option 2: Download ALL nonprofits
creator.create_all_gold_tables(
    use_irs_data=True,
    download_all_irs=True  # ← Get all 1.9M+ orgs
)

# Option 3: Filter by NTEE codes
creator.create_all_gold_tables(
    states=["AL"],
    ntee_codes=["E", "P"],  # Health + Human Services
    use_irs_data=True
)

Standardization

IRS data is automatically converted to ProPublica-compatible format:

# IRS columns β†’ ProPublica schema
{
    'ein': df.get('ein'),
    'name': df.get('name'),
    'city': df.get('city'),
    'state': df.get('state'),
    'ntee_code': df.get('ntee_cd'),
    'asset_amount': df.get('asset_amt'),
    'income_amount': df.get('income_amt'),
    'street_address': df.get('street'),
    'zip_code': df.get('zip'),
    'data_source': 'IRS_EO_BMF'  # Track source
}

This allows you to:

  • βœ… Mix IRS + ProPublica data
  • βœ… Use same gold table schema
  • βœ… Switch between sources without changing downstream code

πŸŽ“ NTEE Codes Reference

Common NTEE codes for community services:

Code Category Example Organizations
E Health Hospitals, clinics, mental health
E30 Ambulatory Health Center Community health centers
E32 School-Based Health Care School clinics
E60 Health Support Services Medical equipment, patient support
E70 Public Health Program Disease prevention, health education
P Human Services Food banks, shelters, counseling
P20 Human Service Organizations Multi-service agencies
K Food, Agriculture Food pantries, nutrition programs
L Housing, Shelter Homeless shelters, affordable housing
S Community Improvement Community development, civic groups
W Public Affairs Advocacy, civil rights, voting

Full NTEE taxonomy: https://nccs.urban.org/project/national-taxonomy-exempt-entities-ntee-codes


πŸ“ˆ Performance Benchmarks

Tested on standard cloud VM (4 vCPU, 16 GB RAM):

Operation Time Records File Size
Download Region 1 ~4 sec 277,214 25 MB
Download Region 2 ~3 sec 717,691 60 MB
Download Region 3 ~5 sec 952,412 80 MB
Download Region 4 ~1 sec 4,921 1 MB
Download ALL 4 regions ~30 sec 1,952,238 170 MB
Load from cache (parquet) ~1 sec 1,952,238 120 MB
Filter by NTEE (health) ~2 sec ~80,000 6 MB
Create 4 gold tables (AL) ~6 sec 26,148 4 MB
Create 4 gold tables (ALL) ~5 min 1,952,238 250 MB

Recommendation: Always download all regions on first run, then filter locally. Much faster than downloading individual states!


πŸ†š When to Use IRS vs ProPublica

Use IRS EO-BMF When:

βœ… You need comprehensive coverage (all nonprofits in a state)
βœ… You're doing bulk analysis (e.g., "all health orgs in Southeast")
βœ… You need offline access to data
βœ… You want faster performance (bulk downloads)
βœ… You're building a complete nonprofit directory

Use ProPublica API When:

βœ… You need real-time updates (IRS is monthly)
βœ… You want detailed Form 990 financial breakdowns
βœ… You need executive compensation data
βœ… You want mission statements (IRS doesn't have these)
βœ… You're searching for a specific organization by name

Best Practice: Use Both!

  1. IRS for bulk discovery and coverage
  2. ProPublica for enrichment with detailed financials
# 1. Download all Alabama orgs from IRS
irs = IRSBMFIngestion()
df_all = irs.download_state_file("AL")  # 26,148 orgs

# 2. Enrich top 100 with ProPublica details
propublica = NonprofitDiscovery()
for ein in df_all.head(100)['ein']:
    details = propublica.get_propublica_org_details(ein)
    # details contains mission, programs, detailed financials

πŸ”§ Troubleshooting

Download Fails with Timeout Error

# Increase timeout
irs = IRSBMFIngestion()
# Edit download_regional_file() timeout parameter (default: 300 seconds)

Out of Memory Error

# Process states individually instead of all regions
for state in ["AL", "GA", "FL"]:
    df = irs.download_state_file(state)
    # Process each state separately

Need Fresh Data

# Force refresh (bypass cache)
df = irs.download_all_regions(force_refresh=True)

πŸ“š Related Documentation


🎯 Citation

When using IRS EO-BMF data in publications:

@misc{irs_eobmf_2026,
  title = {Exempt Organizations Business Master File Extract (EO-BMF)},
  author = {{Internal Revenue Service}},
  year = {2026},
  month = {April},
  url = {https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf},
  note = {Accessed: 2026-04-27. Record count: 1,952,238 organizations.}
}

✨ Key Takeaways

🎯 IRS EO-BMF provides ALL 1.9M+ U.S. nonprofits
⚑ 1,000x more data than ProPublica API per request
πŸ’Ύ Downloads in ~30 seconds, cached for instant future access
πŸ”„ Seamlessly integrates with existing pipeline
πŸ“Š Updated monthly by the IRS
πŸ†“ Completely free, public domain data

Start using it today!

python scripts/create_all_gold_tables.py --nonprofits-only --use-irs --download-all-irs