Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / website /docs /data-sources /form-990-xml.md

jcbowyer

Clean HuggingFace deployment without binary files

61d29fc 28 days ago

preview code

raw

history blame contribute delete

24.8 kB

	---
	sidebar_position: 12
	---

	# Form 990 XML Data (GivingTuesday Data Lake)

	Extract detailed financial data from IRS Form 990 XML filings using [GivingTuesday's 990 Data Infrastructure](https://990data.givingtuesday.org/).

	## 🎯 Overview

	Current data: IRS EO-BMF CSV files (basic info - name, EIN, address, NTEE code)
	Enhancement: Form 990 XML filings from GivingTuesday Data Lake (detailed financials - revenue, expenses, programs, grants)

	### What is the GivingTuesday 990 Data Lake?

	The [990 Data Infrastructure](https://990data.givingtuesday.org/) is a collaborative data lake of clean, standardized 990 data in XML format maintained by GivingTuesday. This is the rawest form of 990 data in GivingTuesday's infrastructure.

	Data Lake Structure:
	- Bucket: `gt990datalake-rawdata` (AWS S3, us-east-1 Virginia)
	- Access: Public, no AWS credentials required (`--no-sign-request`)
	- E-filed 990s: `EfileData/XmlFiles/` (individual XML returns)
	- Indices: `Indices/990xmls/` (CSV files listing all available 990s)

	Console Access: [https://us-east-1.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata](https://us-east-1.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata)

	### What's the Difference?

	\| Data Source \| Type \| Records \| Data Richness \| Access Method \| Best For \|
	\|-------------\|------\|---------\|---------------\|---------------\|----------\|
	\| EO-BMF CSV ✅ Currently using \| Basic registry \| 1.9M+ \| ⭐ Low \| Direct download \| Initial org list \|
	\| Google BigQuery ⚡ Recommended \| SQL queries \| 5M+ \| ⭐⭐⭐⭐⭐ High \| SQL (serverless) \| Bulk mission/website extraction \|
	\| GivingTuesday Data Lake 🚀 Advanced \| XML files \| 5.4M+ \| ⭐⭐⭐⭐⭐ Very High \| S3 download \| Detailed parsing, custom fields \|

	## 📊 What Additional Data You Can Get

	### From Form 990 XML:
	- Financials: Total revenue, program revenue, contributions, grants, investment income
	- Expenses: Total expenses, program expenses, administrative, fundraising
	- Assets: Total assets, liabilities, net assets
	- Programs: Program service descriptions, accomplishments, expenses per program
	- Governance: Board members, officer compensation, key employees
	- Grants: Grants awarded, grant recipients
	- Mission: Detailed mission statement and program descriptions
	- Activities: Legislative activities, political expenditures, lobbying

	Example: Instead of just knowing "Alabama Oral Health Foundation exists," you get:
	- Revenue: $2.5M
	- Program expenses: $1.8M
	- Grants awarded: $500K to 10 community health centers
	- Mission: "Improve oral health access in underserved communities"
	- Officers: CEO Sarah Johnson ($150K salary)
	- Website: https://alabamaoralhealth.org

	---

	## ⚡ Google BigQuery (Recommended for Bulk Queries)

	Fastest way to enrich 1M+ organizations with missions and websites!

	### Why BigQuery?

	Google Cloud hosts the complete IRS 990 dataset in BigQuery - a serverless SQL database that lets you query 5 million Form 990s in seconds without downloading any files.

	Key advantages:
	- ✅ No downloads: Query directly in the cloud
	- ✅ Blazing fast: Bulk queries complete in <30 seconds
	- ✅ Free tier: First 1 TB/month is free (enough for most research)
	- ✅ SQL interface: Easy to extract specific fields
	- ✅ No infrastructure: Serverless, nothing to manage

	Cost: Form 990 text fields are small - you can query all 1.9M nonprofits for ~$0 using the free tier.

	### Quick Start

	1. Set up Google Cloud (one-time)
	```bash
	# Install Google Cloud SDK
	# Visit: https://cloud.google.com/sdk/docs/install

	# Authenticate
	gcloud auth login

	# Set project
	gcloud config set project YOUR_PROJECT_ID
	```

	2. Extract mission statements & websites for all Alabama health orgs

	```sql
	-- Query in BigQuery Console or via bq CLI
	SELECT
	ein,
	organization_name,
	website_address_txt AS website,
	activity_or_mission_desc AS mission,
	total_revenue_current_year AS revenue,
	total_expenses_current_year AS expenses,
	tax_period
	FROM `bigquery-public-data.irs_990.irs_990_2023`
	WHERE state = 'AL'
	AND ntee_code LIKE 'E%'
	AND activity_or_mission_desc IS NOT NULL
	ORDER BY total_revenue_current_year DESC
	LIMIT 10000;
	```

	3. Run query from Python

	```python
	from google.cloud import bigquery
	import pandas as pd

	# Initialize BigQuery client
	client = bigquery.Client()

	# Query for Alabama + Michigan health nonprofits with missions
	query = """
	SELECT
	ein,
	organization_name,
	website_address_txt AS website,
	activity_or_mission_desc AS mission,
	total_revenue_current_year AS revenue,
	total_expenses_current_year AS expenses,
	total_assets_eoy AS assets,
	state,
	tax_period
	FROM `bigquery-public-data.irs_990.irs_990_2023`
	WHERE state IN ('AL', 'MI')
	AND ntee_code LIKE 'E%'
	AND total_revenue_current_year > 0
	AND activity_or_mission_desc IS NOT NULL
	"""

	# Execute query and load to DataFrame
	df = client.query(query).to_dataframe()
	print(f"Retrieved {len(df):,} organizations with missions")

	# Clean XML tags from mission text
	import re
	df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True).str.strip()

	# Save locally
	df.to_parquet('data/gold/nonprofits_990_bigquery.parquet')
	```

	4. Merge with existing nonprofit data

	```python
	import pandas as pd

	# Load your existing nonprofit data (from IRS EO-BMF)
	orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
	print(f"Existing orgs: {len(orgs):,}")

	# Load BigQuery results with missions & websites
	bq_data = pd.read_parquet('data/gold/nonprofits_990_bigquery.parquet')
	print(f"BigQuery results: {len(bq_data):,}")

	# Merge on EIN
	enriched = orgs.merge(
	bq_data[['ein', 'mission_clean', 'website', 'revenue', 'expenses', 'assets']],
	on='ein',
	how='left',
	suffixes=('', '_990')
	)

	# Fill missing data from 990 fields
	if 'mission' not in enriched.columns:
	enriched['mission'] = enriched['mission_clean']
	if 'website' not in enriched.columns:
	enriched['website'] = enriched['website_990']

	# Show enrichment stats
	missions_added = enriched['mission'].notna().sum()
	websites_added = enriched['website'].notna().sum()
	print(f"✅ Missions: {missions_added:,} ({100*missions_added/len(enriched):.1f}%)")
	print(f"✅ Websites: {websites_added:,} ({100*websites_added/len(enriched):.1f}%)")

	# Save enriched dataset
	enriched.to_parquet('data/gold/nonprofits_enriched_bigquery.parquet')
	print(f"💾 Saved {len(enriched):,} enriched organizations")
	```

	Expected results:
	- 30-50% of orgs will have missions (larger orgs file Form 990)
	- 20-40% will have websites listed
	- 100% will have EIN matching for revenue/expense data


	### BigQuery Table Structure

	The IRS 990 dataset is organized into multiple tables matching Form 990 schedules:

	#### Master Index Table
	`bigquery-public-data.irs_990.irs_990_index`
	- Links all returns together
	- Fields: `ein`, `organization_name`, `tax_period`, `return_id` (foreign key)

	#### Main Return Tables (by year)
	`bigquery-public-data.irs_990.irs_990_YYYY` - Full Form 990
	- Mission: `activity_or_mission_desc` ⭐
	- Website: `website_address_txt` ⭐
	- Financials: `total_revenue_current_year`, `total_expenses_current_year`
	- Assets: `total_assets_eoy`, `total_liabilities_eoy`
	- State/NTEE: `state`, `ntee_code`

	`bigquery-public-data.irs_990.irs_990_ez_YYYY` - Form 990-EZ (smaller orgs)
	- Mission: `mission_description` ⭐
	- Website: `website_address_txt` ⭐
	- Financials: `total_revenue`, `total_expenses`

	`bigquery-public-data.irs_990.irs_990_pf_YYYY` - Form 990-PF (Private Foundations)
	- Grants: Largest grants awarded (for grantmakers)
	- Financials: Foundation-specific fields

	#### Schedule Tables (Detailed Information)
	`bigquery-public-data.irs_990.irs_990_schedule_a_YYYY`
	- Public charity status and public support calculations

	`bigquery-public-data.irs_990.irs_990_schedule_j_YYYY`
	- Executive compensation (CEO, CFO, board member salaries) 💰

	`bigquery-public-data.irs_990.irs_990_schedule_r_YYYY`
	- Related organizations and transactions

	### Complete Field Mapping

	\| Data Point \| Table Name \| Field Name \| Notes \|
	\|------------\|------------\|------------\|-------\|
	\| Organization Name \| `irs_990_index` \| `organization_name` \| Master list \|
	\| EIN (Primary Key) \| All tables \| `ein` \| 9-digit ID \|
	\| Mission (990-EZ) \| `irs_990_ez_YYYY` \| `mission_description` \| Smaller orgs \|
	\| Mission (Full 990) \| `irs_990_YYYY` \| `activity_or_mission_desc` \| Larger orgs \|
	\| Website URL \| `irs_990_YYYY`, `irs_990_ez_YYYY` \| `website_address_txt` \| Both forms \|
	\| Total Revenue \| `irs_990_YYYY` \| `total_revenue_current_year` \| Annual revenue \|
	\| Total Expenses \| `irs_990_YYYY` \| `total_expenses_current_year` \| Annual expenses \|
	\| Program Expenses \| `irs_990_YYYY` \| `program_service_revenue` \| Program revenue \|
	\| Assets \| `irs_990_YYYY` \| `total_assets_eoy` \| End of year \|
	\| Liabilities \| `irs_990_YYYY` \| `total_liabilities_eoy` \| End of year \|
	\| Executive Salaries \| `irs_990_schedule_j_YYYY` \| Compensation fields \| CEO, CFO pay \|
	\| Grants Paid \| `irs_990_pf_YYYY` \| Grant fields \| For foundations \|
	\| Tax Period \| All tables \| `tax_period` \| YYYYMMDD format \|
	\| State \| `irs_990_YYYY` \| `state` \| 2-letter code \|

	### Query Examples for All Key Fields

	Extract mission, website, AND revenue from both 990 and 990-EZ:

	```sql
	-- Combine Full 990 and 990-EZ for complete coverage
	WITH full_990 AS (
	SELECT
	ein,
	activity_or_mission_desc AS mission,
	website_address_txt AS website,
	total_revenue_current_year AS revenue,
	total_expenses_current_year AS expenses,
	'990' AS form_type
	FROM `bigquery-public-data.irs_990.irs_990_2023`
	WHERE state IN ('AL', 'MI')
	AND activity_or_mission_desc IS NOT NULL
	),
	ez_990 AS (
	SELECT
	ein,
	mission_description AS mission,
	website_address_txt AS website,
	total_revenue AS revenue,
	total_expenses AS expenses,
	'990-EZ' AS form_type
	FROM `bigquery-public-data.irs_990.irs_990_ez_2023`
	WHERE state IN ('AL', 'MI')
	AND mission_description IS NOT NULL
	)
	SELECT * FROM full_990
	UNION ALL
	SELECT * FROM ez_990
	ORDER BY revenue DESC;
	```

	Add executive compensation:

	```sql
	-- Get mission + CEO salary
	SELECT
	f.ein,
	f.activity_or_mission_desc AS mission,
	f.website_address_txt AS website,
	f.total_revenue_current_year AS revenue,
	j.compensation_amount AS ceo_compensation
	FROM `bigquery-public-data.irs_990.irs_990_2023` f
	LEFT JOIN `bigquery-public-data.irs_990.irs_990_schedule_j_2023` j
	ON f.ein = j.ein
	WHERE f.state = 'AL'
	AND j.title_txt LIKE '%CEO%' OR j.title_txt LIKE '%President%'
	ORDER BY revenue DESC
	LIMIT 100;
	```

	### Data Cleaning Tips

	The catch: Some fields have messy XML tags embedded (like `<MissionDesc>`). Clean with regex:

	```python
	import re
	import pandas as pd

	# Clean XML tags from missions
	df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True)

	# Trim whitespace
	df['mission_clean'] = df['mission_clean'].str.strip()

	# Remove common artifacts
	df['mission_clean'] = df['mission_clean'].str.replace(r'\s+', ' ', regex=True)
	```

	### Cost Estimation

	Free tier: 1 TB of queries per month (resets monthly)

	Typical query costs:
	- Extract missions for 1M orgs: ~50 GB scanned = FREE
	- Extract all fields for 1M orgs: ~200 GB scanned = FREE
	- Full table scan of all years: ~2 TB = $10 (one-time cost)

	Tip: Use `WHERE` clauses to filter by state/NTEE to reduce data scanned.

	---

	## 🚀 GivingTuesday Data Lake (For Advanced Parsing)

	### Option 1: Via AWS Console (Free Account)

	1. Visit [aws.amazon.com](https://aws.amazon.com) and create a free AWS account (requires CC for validation, but no charges for accessing this data)
	2. Log in to AWS Console
	3. Open the data lake: [https://s3.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata/?region=us-east-1&tab=objects](https://s3.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata/?region=us-east-1&tab=objects)

	### Option 2: Via Command Line (Recommended for Automation)

	Prerequisites: Install [AWS CLI](https://aws.amazon.com/cli/)

	```bash
	# List main bucket contents
	aws s3 ls gt990datalake-rawdata --no-sign-request

	# List indices (CSV files listing all 990s)
	aws s3 ls gt990datalake-rawdata/Indices/990xmls/ --no-sign-request

	# Download the latest index
	aws s3 cp \
	s3://gt990datalake-rawdata/Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv \
	data/cache/form990_index.csv \
	--no-sign-request

	# Download a specific 990 XML
	aws s3 cp \
	s3://gt990datalake-rawdata/EfileData/XmlFiles/[OBJECT_ID]_public.xml \
	data/cache/form_990_xml/ \
	--no-sign-request
	```

	### Option 3: Automated Python Integration

	Use our enrichment script that automates downloading and parsing:

	```python
	import boto3
	from botocore import UNSIGNED
	from botocore.config import Config
	import pandas as pd

	# Configure S3 client for GivingTuesday Data Lake (no credentials needed)
	s3 = boto3.client(
	's3',
	region_name='us-east-1',
	config=Config(signature_version=UNSIGNED)
	)

	# Download index to find available 990s
	index_response = s3.get_object(
	Bucket='gt990datalake-rawdata',
	Key='Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv'
	)
	index_df = pd.read_csv(index_response['Body'])

	print(f"Available 990s: {len(index_df):,}")
	print(f"Columns: {index_df.columns.tolist()}")

	# Find 990s for a specific EIN
	ein = "123456789"
	org_filings = index_df[index_df['EIN'] == ein]

	# Download latest filing
	if len(org_filings) > 0:
	latest = org_filings.iloc[0]
	xml_key = f"EfileData/XmlFiles/{latest['OBJECT_ID']}_public.xml"

	xml_obj = s3.get_object(Bucket='gt990datalake-rawdata', Key=xml_key)
	xml_content = xml_obj['Body'].read()

	# Parse with xmltodict (simplified approach)
	import xmltodict
	doc = xmltodict.parse(xml_content)
	# Extract fields from doc['Return']['ReturnData']['IRS990']
	```

	## 🤖 Automated Enrichment Script

	We provide [scripts/enrich_nonprofits_gt990.py](../../scripts/enrich_nonprofits_gt990.py) - a complete automated solution.

	### Quick Start

	Step 1: Download Index (One-Time Setup)
	```bash
	# Install dependencies
	pip install boto3 xmltodict pandas pyarrow tqdm loguru

	# Download GivingTuesday Data Lake index (~200MB CSV, 1M+ records)
	python scripts/enrich_nonprofits_gt990.py --download-index
	```

	This creates `data/cache/form990_gt_index.parquet` for fast EIN→OBJECT_ID lookups.

	Step 2: Enrich Your Data
	```bash
	# Enrich all Tuscaloosa nonprofits
	python scripts/enrich_nonprofits_gt990.py \
	--input data/gold/nonprofits_tuscaloosa.parquet \
	--output data/gold/nonprofits_tuscaloosa_form990.parquet \
	--concurrent 20

	# Enrich Alabama + Michigan health orgs
	python scripts/enrich_nonprofits_gt990.py \
	--input data/gold/nonprofits_organizations.parquet \
	--output data/gold/nonprofits_990_enriched.parquet \
	--states AL MI \
	--ntee E \
	--concurrent 50

	# Test with sample
	python scripts/enrich_nonprofits_gt990.py \
	--input data/gold/nonprofits_organizations.parquet \
	--output /tmp/test_990.parquet \
	--sample 100
	```

	### Features

	✅ Index-based lookup: Uses OBJECT_ID from GivingTuesday index (no filename guessing)
	✅ Async/parallel: Process 20-50 organizations concurrently
	✅ Smart caching: JSON cache prevents re-downloading same 990s
	✅ Automatic retries: Handles S3 errors gracefully
	✅ Progress tracking: tqdm progress bar with ETA
	✅ Comprehensive logging: Detailed logs with statistics

	### Performance

	- Speed: ~2-3 sec/org (download + parse)
	- Concurrent=20: ~450 orgs/hour
	- Concurrent=50: ~1,100 orgs/hour
	- 1,000 orgs @ 50% success: ~15-20 minutes

	### Enriched Fields

	The script adds these columns to your DataFrame:

	```python
	form_990_status # 'found', 'not_found', or 'parse_error'
	form_990_tax_year # e.g., 202312
	form_990_filing_type # 990, 990EZ, or 990PF
	form_990_total_revenue # Total revenue
	form_990_total_expenses # Total expenses
	form_990_net_income # Revenue - Expenses
	form_990_contributions # Donations and grants received
	form_990_program_revenue # Revenue from programs/services
	form_990_investment_income # Investment income
	form_990_program_expenses # Program service expenses
	form_990_admin_expenses # Administrative expenses
	form_990_fundraising_expenses # Fundraising expenses
	form_990_total_assets # Total assets
	form_990_total_liabilities # Total liabilities
	form_990_net_assets # Assets - Liabilities
	form_990_grants_paid # Grants awarded to others
	form_990_mission # Mission statement
	form_990_last_updated # Timestamp of enrichment

	# Filter to your state
	alabama_filings = filings_index[filings_index['State'] == 'AL']
	print(f"Alabama filings: {len(alabama_filings):,}")

	# Sample columns
	# EIN, OrganizationName, State, URL, SubmittedOn, TaxPeriod
	```

	## 🔗 Integration with Current Data

	### Enrich Existing Nonprofits

	```python
	import pandas as pd
	from form990_parser import Form990Parser
	import boto3

	# Load your current nonprofit data
	orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet')

	# Filter to Alabama health organizations (NTEE code E)
	health_orgs = orgs[
	(orgs['state'] == 'AL') &
	(orgs['ntee_code'].str.startswith('E', na=False))
	]

	print(f"Alabama health nonprofits: {len(health_orgs):,}")

	# Enrich with Form 990 data
	s3 = boto3.client('s3', region_name='us-east-1')
	parser = Form990Parser()

	enriched = []
	for idx, org in health_orgs.iterrows():
	ein = org['ein']

	# Try to find most recent 990
	try:
	# Construct likely S3 key (simplified - actual naming varies)
	key = f"{ein}_202312_990.xml"

	xml_obj = s3.get_object(Bucket='irs-form-990', Key=key)
	xml_content = xml_obj['Body'].read()

	filing_data = parser.parse_xml(xml_content)

	# Merge with org data
	org_enriched = org.to_dict()
	org_enriched.update(filing_data)
	enriched.append(org_enriched)

	except Exception as e:
	# No 990 found for this org
	continue

	enriched_df = pd.DataFrame(enriched)
	enriched_df.to_parquet('data/gold/nonprofits_alabama_health_990.parquet')

	print(f"Enriched {len(enriched_df):,} organizations with Form 990 data")
	```

	## 💾 Data Schema

	### Form 990 Parser Output

	```python
	{
	# Basic Info
	'ein': '123456789',
	'organization_name': 'Alabama Oral Health Foundation',
	'tax_year': 2023,
	'tax_period': '202312',

	# Financials
	'total_revenue': 2500000,
	'total_expenses': 2100000,
	'net_income': 400000,
	'total_assets': 5000000,
	'total_liabilities': 500000,

	# Revenue Breakdown
	'contributions_grants': 1200000,
	'program_service_revenue': 800000,
	'investment_income': 300000,
	'other_revenue': 200000,

	# Expense Breakdown
	'program_expenses': 1800000,
	'administrative_expenses': 200000,
	'fundraising_expenses': 100000,

	# Programs
	'program_service_descriptions': [
	{
	'description': 'Community dental clinics',
	'expenses': 1000000,
	'grants': 200000
	},
	{
	'description': 'School fluoride programs',
	'expenses': 500000,
	'grants': 100000
	}
	],

	# Governance
	'officers': [
	{
	'name': 'Sarah Johnson',
	'title': 'CEO',
	'compensation': 150000
	},
	{
	'name': 'John Smith',
	'title': 'CFO',
	'compensation': 120000
	}
	],

	# Mission
	'mission_statement': 'Improve oral health access in underserved communities...',
	'program_accomplishments': 'Served 10,000 patients in 2023...'
	}
	```

	## 📈 Performance Considerations

	### Data Volume

	- Form 990 XMLs: ~300,000 new filings per year
	- Average XML size: 500KB - 5MB
	- Total storage: ~500GB for all historical 990s (2011-present)

	### Processing Speed

	Sequential (current approach):
	- Download + parse: ~2-5 seconds per 990
	- 300,000 filings × 3 sec = 250 hours 😱

	Async parallel (recommended):
	- 50 concurrent workers
	- 300,000 filings × 3 sec / 50 = 5 hours ⚡

	### Smart Strategies

	1. Filter first: Only download 990s for organizations you care about
	```python
	# Only health orgs in your states
	health_eins = orgs[
	(orgs['state'].isin(['AL', 'MI'])) &
	(orgs['ntee_code'].str.startswith('E'))
	]['ein'].tolist()

	# Result: ~50,000 instead of 300,000 = 1 hour
	```

	2. Use index files: Download the index first, filter, then fetch XMLs
	```python
	# Get index
	index = pd.read_json('https://s3.amazonaws.com/irs-form-990/index_2023.json')

	# Filter to your EINs
	relevant = index[index['EIN'].isin(health_eins)]

	# Only download these
	for url in relevant['URL']:
	# download and parse
	```

	3. Cache aggressively: 990s don't change after filing
	```python
	cache_dir = Path('data/cache/form_990_xml')
	cache_file = cache_dir / f"{ein}_{tax_year}.parquet"

	if cache_file.exists():
	return pd.read_parquet(cache_file)
	else:
	# download, parse, cache
	```

	## 🎯 Use Cases

	### 1. Financial Health Analysis

	```python
	# Which nonprofits are most financially stable?
	df['efficiency_ratio'] = df['program_expenses'] / df['total_expenses']
	df['reserve_months'] = df['net_assets'] / (df['total_expenses'] / 12)

	efficient = df[df['efficiency_ratio'] > 0.75] # >75% on programs
	print(f"Efficient organizations: {len(efficient):,}")
	```

	### 2. Grant Research

	```python
	# Who's giving grants in oral health?
	grantmakers = df[
	(df['grants_paid'] > 0) &
	(df['ntee_code'].str.startswith('E'))
	]

	print(f"Oral health grantmakers: {len(grantmakers):,}")
	print(f"Total grants: ${grantmakers['grants_paid'].sum():,.0f}")
	```

	### 3. Program Discovery

	```python
	# Find organizations running specific programs
	fluoride_programs = df[
	df['program_service_descriptions'].str.contains('fluoride', case=False, na=False)
	]

	print(f"Orgs with fluoride programs: {len(fluoride_programs):,}")
	```

	## 🚀 Next Steps

	### Quick Test

	```bash
	# Install dependencies
	pip install form-990-xml-parser boto3

	# Test with single organization
	python -c "
	from form990_parser import Form990Parser
	import boto3

	s3 = boto3.client('s3', region_name='us-east-1')
	ein = '631307851' # Delta Dental of Alabama (example)

	try:
	# Try to fetch 2023 filing
	key = f'{ein}_202312_990.xml'
	obj = s3.get_object(Bucket='irs-form-990', Key=key)

	parser = Form990Parser()
	data = parser.parse_xml(obj['Body'].read())

	print(f'Organization: {data.get(\"organization_name\")}')
	print(f'Revenue: \${data.get(\"total_revenue\", 0):,.0f}')
	print(f'Assets: \${data.get(\"total_assets\", 0):,.0f}')
	except Exception as e:
	print(f'Error: {e}')
	"
	```

	### Full Integration

	Create a new enrichment pipeline:

	```bash
	# Create new script
	python scripts/enrich_nonprofits_form990.py \
	--input data/gold/nonprofits_organizations.parquet \
	--output data/gold/nonprofits_organizations_990.parquet \
	--states AL MI \
	--ntee E \
	--concurrent 50
	```

	## 📚 Resources

	- Giving Tuesday GitHub: https://github.com/Giving-Tuesday
	- Form 990 XML Parser: https://github.com/Giving-Tuesday/form-990-xml-parser
	- Form 990 XML Mapper: https://github.com/Giving-Tuesday/form-990-xml-mapper
	- IRS 990 AWS Bucket: https://registry.opendata.aws/irs990/
	- IRS Index Files: https://s3.amazonaws.com/irs-form-990/index_YYYY.json
	- 990 Schema Documentation: https://www.irs.gov/e-file-providers/current-valid-xml-schemas-and-business-rules

	## ❓ FAQ

	### Q: Are we currently downloading XML?

	A: No. Currently using EO-BMF CSV files (basic data). Form 990 XML would be an enhancement for detailed financials.

	### Q: Can we use Giving Tuesday libraries?

	A: Yes! They're open source and designed exactly for this purpose. Would provide much richer data.

	### Q: How much data is it?

	A:
	- All 990s (2011-present): ~500GB
	- Alabama only: ~5GB
	- Alabama health orgs: ~500MB
	- Very manageable!

	### Q: What's the license?

	A: Public domain (U.S. Government data) + Giving Tuesday tools are open source

	### Q: Integration effort?

	A: Low - can reuse existing async enrichment patterns. Estimated: 1-2 days for initial integration.