Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| sidebar_position: 12 | |
| # Form 990 XML Data (GivingTuesday Data Lake) | |
| Extract detailed financial data from IRS Form 990 XML filings using [GivingTuesday's 990 Data Infrastructure](https://990data.givingtuesday.org/). | |
| ## ๐ฏ Overview | |
| **Current data:** IRS EO-BMF CSV files (basic info - name, EIN, address, NTEE code) | |
| **Enhancement:** Form 990 XML filings from GivingTuesday Data Lake (detailed financials - revenue, expenses, programs, grants) | |
| ### What is the GivingTuesday 990 Data Lake? | |
| The [990 Data Infrastructure](https://990data.givingtuesday.org/) is a collaborative data lake of clean, standardized 990 data in XML format maintained by GivingTuesday. This is the rawest form of 990 data in GivingTuesday's infrastructure. | |
| **Data Lake Structure:** | |
| - **Bucket**: `gt990datalake-rawdata` (AWS S3, us-east-1 Virginia) | |
| - **Access**: Public, no AWS credentials required (`--no-sign-request`) | |
| - **E-filed 990s**: `EfileData/XmlFiles/` (individual XML returns) | |
| - **Indices**: `Indices/990xmls/` (CSV files listing all available 990s) | |
| **Console Access**: [https://us-east-1.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata](https://us-east-1.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata) | |
| ### What's the Difference? | |
| | Data Source | Type | Records | Data Richness | Access Method | Best For | | |
| |-------------|------|---------|---------------|---------------|----------| | |
| | **EO-BMF CSV** โ Currently using | Basic registry | 1.9M+ | โญ Low | Direct download | Initial org list | | |
| | **Google BigQuery** โก Recommended | SQL queries | 5M+ | โญโญโญโญโญ High | SQL (serverless) | **Bulk mission/website extraction** | | |
| | **GivingTuesday Data Lake** ๐ Advanced | XML files | 5.4M+ | โญโญโญโญโญ Very High | S3 download | Detailed parsing, custom fields | | |
| ## ๐ What Additional Data You Can Get | |
| ### From Form 990 XML: | |
| - **Financials**: Total revenue, program revenue, contributions, grants, investment income | |
| - **Expenses**: Total expenses, program expenses, administrative, fundraising | |
| - **Assets**: Total assets, liabilities, net assets | |
| - **Programs**: Program service descriptions, accomplishments, expenses per program | |
| - **Governance**: Board members, officer compensation, key employees | |
| - **Grants**: Grants awarded, grant recipients | |
| - **Mission**: Detailed mission statement and program descriptions | |
| - **Activities**: Legislative activities, political expenditures, lobbying | |
| **Example:** Instead of just knowing "Alabama Oral Health Foundation exists," you get: | |
| - Revenue: $2.5M | |
| - Program expenses: $1.8M | |
| - Grants awarded: $500K to 10 community health centers | |
| - Mission: "Improve oral health access in underserved communities" | |
| - Officers: CEO Sarah Johnson ($150K salary) | |
| - Website: https://alabamaoralhealth.org | |
| --- | |
| ## โก Google BigQuery (Recommended for Bulk Queries) | |
| **Fastest way to enrich 1M+ organizations with missions and websites!** | |
| ### Why BigQuery? | |
| Google Cloud hosts the complete IRS 990 dataset in BigQuery - a serverless SQL database that lets you query **5 million Form 990s in seconds** without downloading any files. | |
| **Key advantages:** | |
| - โ **No downloads**: Query directly in the cloud | |
| - โ **Blazing fast**: Bulk queries complete in <30 seconds | |
| - โ **Free tier**: First 1 TB/month is free (enough for most research) | |
| - โ **SQL interface**: Easy to extract specific fields | |
| - โ **No infrastructure**: Serverless, nothing to manage | |
| **Cost:** Form 990 text fields are small - you can query **all 1.9M nonprofits for ~$0** using the free tier. | |
| ### Quick Start | |
| **1. Set up Google Cloud (one-time)** | |
| ```bash | |
| # Install Google Cloud SDK | |
| # Visit: https://cloud.google.com/sdk/docs/install | |
| # Authenticate | |
| gcloud auth login | |
| # Set project | |
| gcloud config set project YOUR_PROJECT_ID | |
| ``` | |
| **2. Extract mission statements & websites for all Alabama health orgs** | |
| ```sql | |
| -- Query in BigQuery Console or via bq CLI | |
| SELECT | |
| ein, | |
| organization_name, | |
| website_address_txt AS website, | |
| activity_or_mission_desc AS mission, | |
| total_revenue_current_year AS revenue, | |
| total_expenses_current_year AS expenses, | |
| tax_period | |
| FROM `bigquery-public-data.irs_990.irs_990_2023` | |
| WHERE state = 'AL' | |
| AND ntee_code LIKE 'E%' | |
| AND activity_or_mission_desc IS NOT NULL | |
| ORDER BY total_revenue_current_year DESC | |
| LIMIT 10000; | |
| ``` | |
| **3. Run query from Python** | |
| ```python | |
| from google.cloud import bigquery | |
| import pandas as pd | |
| # Initialize BigQuery client | |
| client = bigquery.Client() | |
| # Query for Alabama + Michigan health nonprofits with missions | |
| query = """ | |
| SELECT | |
| ein, | |
| organization_name, | |
| website_address_txt AS website, | |
| activity_or_mission_desc AS mission, | |
| total_revenue_current_year AS revenue, | |
| total_expenses_current_year AS expenses, | |
| total_assets_eoy AS assets, | |
| state, | |
| tax_period | |
| FROM `bigquery-public-data.irs_990.irs_990_2023` | |
| WHERE state IN ('AL', 'MI') | |
| AND ntee_code LIKE 'E%' | |
| AND total_revenue_current_year > 0 | |
| AND activity_or_mission_desc IS NOT NULL | |
| """ | |
| # Execute query and load to DataFrame | |
| df = client.query(query).to_dataframe() | |
| print(f"Retrieved {len(df):,} organizations with missions") | |
| # Clean XML tags from mission text | |
| import re | |
| df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True).str.strip() | |
| # Save locally | |
| df.to_parquet('data/gold/nonprofits_990_bigquery.parquet') | |
| ``` | |
| **4. Merge with existing nonprofit data** | |
| ```python | |
| import pandas as pd | |
| # Load your existing nonprofit data (from IRS EO-BMF) | |
| orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet') | |
| print(f"Existing orgs: {len(orgs):,}") | |
| # Load BigQuery results with missions & websites | |
| bq_data = pd.read_parquet('data/gold/nonprofits_990_bigquery.parquet') | |
| print(f"BigQuery results: {len(bq_data):,}") | |
| # Merge on EIN | |
| enriched = orgs.merge( | |
| bq_data[['ein', 'mission_clean', 'website', 'revenue', 'expenses', 'assets']], | |
| on='ein', | |
| how='left', | |
| suffixes=('', '_990') | |
| ) | |
| # Fill missing data from 990 fields | |
| if 'mission' not in enriched.columns: | |
| enriched['mission'] = enriched['mission_clean'] | |
| if 'website' not in enriched.columns: | |
| enriched['website'] = enriched['website_990'] | |
| # Show enrichment stats | |
| missions_added = enriched['mission'].notna().sum() | |
| websites_added = enriched['website'].notna().sum() | |
| print(f"โ Missions: {missions_added:,} ({100*missions_added/len(enriched):.1f}%)") | |
| print(f"โ Websites: {websites_added:,} ({100*websites_added/len(enriched):.1f}%)") | |
| # Save enriched dataset | |
| enriched.to_parquet('data/gold/nonprofits_enriched_bigquery.parquet') | |
| print(f"๐พ Saved {len(enriched):,} enriched organizations") | |
| ``` | |
| **Expected results:** | |
| - 30-50% of orgs will have missions (larger orgs file Form 990) | |
| - 20-40% will have websites listed | |
| - 100% will have EIN matching for revenue/expense data | |
| ### BigQuery Table Structure | |
| The IRS 990 dataset is organized into **multiple tables** matching Form 990 schedules: | |
| #### Master Index Table | |
| **`bigquery-public-data.irs_990.irs_990_index`** | |
| - Links all returns together | |
| - Fields: `ein`, `organization_name`, `tax_period`, `return_id` (foreign key) | |
| #### Main Return Tables (by year) | |
| **`bigquery-public-data.irs_990.irs_990_YYYY`** - Full Form 990 | |
| - **Mission**: `activity_or_mission_desc` โญ | |
| - **Website**: `website_address_txt` โญ | |
| - **Financials**: `total_revenue_current_year`, `total_expenses_current_year` | |
| - **Assets**: `total_assets_eoy`, `total_liabilities_eoy` | |
| - **State/NTEE**: `state`, `ntee_code` | |
| **`bigquery-public-data.irs_990.irs_990_ez_YYYY`** - Form 990-EZ (smaller orgs) | |
| - **Mission**: `mission_description` โญ | |
| - **Website**: `website_address_txt` โญ | |
| - **Financials**: `total_revenue`, `total_expenses` | |
| **`bigquery-public-data.irs_990.irs_990_pf_YYYY`** - Form 990-PF (Private Foundations) | |
| - **Grants**: Largest grants awarded (for grantmakers) | |
| - **Financials**: Foundation-specific fields | |
| #### Schedule Tables (Detailed Information) | |
| **`bigquery-public-data.irs_990.irs_990_schedule_a_YYYY`** | |
| - Public charity status and public support calculations | |
| **`bigquery-public-data.irs_990.irs_990_schedule_j_YYYY`** | |
| - **Executive compensation** (CEO, CFO, board member salaries) ๐ฐ | |
| **`bigquery-public-data.irs_990.irs_990_schedule_r_YYYY`** | |
| - Related organizations and transactions | |
| ### Complete Field Mapping | |
| | Data Point | Table Name | Field Name | Notes | | |
| |------------|------------|------------|-------| | |
| | **Organization Name** | `irs_990_index` | `organization_name` | Master list | | |
| | **EIN** (Primary Key) | All tables | `ein` | 9-digit ID | | |
| | **Mission (990-EZ)** | `irs_990_ez_YYYY` | `mission_description` | Smaller orgs | | |
| | **Mission (Full 990)** | `irs_990_YYYY` | `activity_or_mission_desc` | Larger orgs | | |
| | **Website URL** | `irs_990_YYYY`, `irs_990_ez_YYYY` | `website_address_txt` | Both forms | | |
| | **Total Revenue** | `irs_990_YYYY` | `total_revenue_current_year` | Annual revenue | | |
| | **Total Expenses** | `irs_990_YYYY` | `total_expenses_current_year` | Annual expenses | | |
| | **Program Expenses** | `irs_990_YYYY` | `program_service_revenue` | Program revenue | | |
| | **Assets** | `irs_990_YYYY` | `total_assets_eoy` | End of year | | |
| | **Liabilities** | `irs_990_YYYY` | `total_liabilities_eoy` | End of year | | |
| | **Executive Salaries** | `irs_990_schedule_j_YYYY` | Compensation fields | CEO, CFO pay | | |
| | **Grants Paid** | `irs_990_pf_YYYY` | Grant fields | For foundations | | |
| | **Tax Period** | All tables | `tax_period` | YYYYMMDD format | | |
| | **State** | `irs_990_YYYY` | `state` | 2-letter code | | |
| ### Query Examples for All Key Fields | |
| **Extract mission, website, AND revenue from both 990 and 990-EZ:** | |
| ```sql | |
| -- Combine Full 990 and 990-EZ for complete coverage | |
| WITH full_990 AS ( | |
| SELECT | |
| ein, | |
| activity_or_mission_desc AS mission, | |
| website_address_txt AS website, | |
| total_revenue_current_year AS revenue, | |
| total_expenses_current_year AS expenses, | |
| '990' AS form_type | |
| FROM `bigquery-public-data.irs_990.irs_990_2023` | |
| WHERE state IN ('AL', 'MI') | |
| AND activity_or_mission_desc IS NOT NULL | |
| ), | |
| ez_990 AS ( | |
| SELECT | |
| ein, | |
| mission_description AS mission, | |
| website_address_txt AS website, | |
| total_revenue AS revenue, | |
| total_expenses AS expenses, | |
| '990-EZ' AS form_type | |
| FROM `bigquery-public-data.irs_990.irs_990_ez_2023` | |
| WHERE state IN ('AL', 'MI') | |
| AND mission_description IS NOT NULL | |
| ) | |
| SELECT * FROM full_990 | |
| UNION ALL | |
| SELECT * FROM ez_990 | |
| ORDER BY revenue DESC; | |
| ``` | |
| **Add executive compensation:** | |
| ```sql | |
| -- Get mission + CEO salary | |
| SELECT | |
| f.ein, | |
| f.activity_or_mission_desc AS mission, | |
| f.website_address_txt AS website, | |
| f.total_revenue_current_year AS revenue, | |
| j.compensation_amount AS ceo_compensation | |
| FROM `bigquery-public-data.irs_990.irs_990_2023` f | |
| LEFT JOIN `bigquery-public-data.irs_990.irs_990_schedule_j_2023` j | |
| ON f.ein = j.ein | |
| WHERE f.state = 'AL' | |
| AND j.title_txt LIKE '%CEO%' OR j.title_txt LIKE '%President%' | |
| ORDER BY revenue DESC | |
| LIMIT 100; | |
| ``` | |
| ### Data Cleaning Tips | |
| **The catch:** Some fields have messy XML tags embedded (like `<MissionDesc>`). Clean with regex: | |
| ```python | |
| import re | |
| import pandas as pd | |
| # Clean XML tags from missions | |
| df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True) | |
| # Trim whitespace | |
| df['mission_clean'] = df['mission_clean'].str.strip() | |
| # Remove common artifacts | |
| df['mission_clean'] = df['mission_clean'].str.replace(r'\s+', ' ', regex=True) | |
| ``` | |
| ### Cost Estimation | |
| **Free tier:** 1 TB of queries per month (resets monthly) | |
| **Typical query costs:** | |
| - Extract missions for 1M orgs: **~50 GB scanned = FREE** | |
| - Extract all fields for 1M orgs: **~200 GB scanned = FREE** | |
| - Full table scan of all years: **~2 TB = $10** (one-time cost) | |
| **Tip:** Use `WHERE` clauses to filter by state/NTEE to reduce data scanned. | |
| --- | |
| ## ๐ GivingTuesday Data Lake (For Advanced Parsing) | |
| ### Option 1: Via AWS Console (Free Account) | |
| 1. Visit [aws.amazon.com](https://aws.amazon.com) and create a free AWS account (requires CC for validation, but no charges for accessing this data) | |
| 2. Log in to AWS Console | |
| 3. Open the data lake: [https://s3.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata/?region=us-east-1&tab=objects](https://s3.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata/?region=us-east-1&tab=objects) | |
| ### Option 2: Via Command Line (Recommended for Automation) | |
| **Prerequisites**: Install [AWS CLI](https://aws.amazon.com/cli/) | |
| ```bash | |
| # List main bucket contents | |
| aws s3 ls gt990datalake-rawdata --no-sign-request | |
| # List indices (CSV files listing all 990s) | |
| aws s3 ls gt990datalake-rawdata/Indices/990xmls/ --no-sign-request | |
| # Download the latest index | |
| aws s3 cp \ | |
| s3://gt990datalake-rawdata/Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv \ | |
| data/cache/form990_index.csv \ | |
| --no-sign-request | |
| # Download a specific 990 XML | |
| aws s3 cp \ | |
| s3://gt990datalake-rawdata/EfileData/XmlFiles/[OBJECT_ID]_public.xml \ | |
| data/cache/form_990_xml/ \ | |
| --no-sign-request | |
| ``` | |
| ### Option 3: Automated Python Integration | |
| Use our enrichment script that automates downloading and parsing: | |
| ```python | |
| import boto3 | |
| from botocore import UNSIGNED | |
| from botocore.config import Config | |
| import pandas as pd | |
| # Configure S3 client for GivingTuesday Data Lake (no credentials needed) | |
| s3 = boto3.client( | |
| 's3', | |
| region_name='us-east-1', | |
| config=Config(signature_version=UNSIGNED) | |
| ) | |
| # Download index to find available 990s | |
| index_response = s3.get_object( | |
| Bucket='gt990datalake-rawdata', | |
| Key='Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv' | |
| ) | |
| index_df = pd.read_csv(index_response['Body']) | |
| print(f"Available 990s: {len(index_df):,}") | |
| print(f"Columns: {index_df.columns.tolist()}") | |
| # Find 990s for a specific EIN | |
| ein = "123456789" | |
| org_filings = index_df[index_df['EIN'] == ein] | |
| # Download latest filing | |
| if len(org_filings) > 0: | |
| latest = org_filings.iloc[0] | |
| xml_key = f"EfileData/XmlFiles/{latest['OBJECT_ID']}_public.xml" | |
| xml_obj = s3.get_object(Bucket='gt990datalake-rawdata', Key=xml_key) | |
| xml_content = xml_obj['Body'].read() | |
| # Parse with xmltodict (simplified approach) | |
| import xmltodict | |
| doc = xmltodict.parse(xml_content) | |
| # Extract fields from doc['Return']['ReturnData']['IRS990'] | |
| ``` | |
| ## ๐ค Automated Enrichment Script | |
| We provide **[scripts/enrich_nonprofits_gt990.py](../../scripts/enrich_nonprofits_gt990.py)** - a complete automated solution. | |
| ### Quick Start | |
| **Step 1: Download Index (One-Time Setup)** | |
| ```bash | |
| # Install dependencies | |
| pip install boto3 xmltodict pandas pyarrow tqdm loguru | |
| # Download GivingTuesday Data Lake index (~200MB CSV, 1M+ records) | |
| python scripts/enrich_nonprofits_gt990.py --download-index | |
| ``` | |
| This creates `data/cache/form990_gt_index.parquet` for fast EINโOBJECT_ID lookups. | |
| **Step 2: Enrich Your Data** | |
| ```bash | |
| # Enrich all Tuscaloosa nonprofits | |
| python scripts/enrich_nonprofits_gt990.py \ | |
| --input data/gold/nonprofits_tuscaloosa.parquet \ | |
| --output data/gold/nonprofits_tuscaloosa_form990.parquet \ | |
| --concurrent 20 | |
| # Enrich Alabama + Michigan health orgs | |
| python scripts/enrich_nonprofits_gt990.py \ | |
| --input data/gold/nonprofits_organizations.parquet \ | |
| --output data/gold/nonprofits_990_enriched.parquet \ | |
| --states AL MI \ | |
| --ntee E \ | |
| --concurrent 50 | |
| # Test with sample | |
| python scripts/enrich_nonprofits_gt990.py \ | |
| --input data/gold/nonprofits_organizations.parquet \ | |
| --output /tmp/test_990.parquet \ | |
| --sample 100 | |
| ``` | |
| ### Features | |
| โ **Index-based lookup**: Uses OBJECT_ID from GivingTuesday index (no filename guessing) | |
| โ **Async/parallel**: Process 20-50 organizations concurrently | |
| โ **Smart caching**: JSON cache prevents re-downloading same 990s | |
| โ **Automatic retries**: Handles S3 errors gracefully | |
| โ **Progress tracking**: tqdm progress bar with ETA | |
| โ **Comprehensive logging**: Detailed logs with statistics | |
| ### Performance | |
| - **Speed**: ~2-3 sec/org (download + parse) | |
| - **Concurrent=20**: ~450 orgs/hour | |
| - **Concurrent=50**: ~1,100 orgs/hour | |
| - **1,000 orgs @ 50% success**: ~15-20 minutes | |
| ### Enriched Fields | |
| The script adds these columns to your DataFrame: | |
| ```python | |
| form_990_status # 'found', 'not_found', or 'parse_error' | |
| form_990_tax_year # e.g., 202312 | |
| form_990_filing_type # 990, 990EZ, or 990PF | |
| form_990_total_revenue # Total revenue | |
| form_990_total_expenses # Total expenses | |
| form_990_net_income # Revenue - Expenses | |
| form_990_contributions # Donations and grants received | |
| form_990_program_revenue # Revenue from programs/services | |
| form_990_investment_income # Investment income | |
| form_990_program_expenses # Program service expenses | |
| form_990_admin_expenses # Administrative expenses | |
| form_990_fundraising_expenses # Fundraising expenses | |
| form_990_total_assets # Total assets | |
| form_990_total_liabilities # Total liabilities | |
| form_990_net_assets # Assets - Liabilities | |
| form_990_grants_paid # Grants awarded to others | |
| form_990_mission # Mission statement | |
| form_990_last_updated # Timestamp of enrichment | |
| # Filter to your state | |
| alabama_filings = filings_index[filings_index['State'] == 'AL'] | |
| print(f"Alabama filings: {len(alabama_filings):,}") | |
| # Sample columns | |
| # EIN, OrganizationName, State, URL, SubmittedOn, TaxPeriod | |
| ``` | |
| ## ๐ Integration with Current Data | |
| ### Enrich Existing Nonprofits | |
| ```python | |
| import pandas as pd | |
| from form990_parser import Form990Parser | |
| import boto3 | |
| # Load your current nonprofit data | |
| orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet') | |
| # Filter to Alabama health organizations (NTEE code E) | |
| health_orgs = orgs[ | |
| (orgs['state'] == 'AL') & | |
| (orgs['ntee_code'].str.startswith('E', na=False)) | |
| ] | |
| print(f"Alabama health nonprofits: {len(health_orgs):,}") | |
| # Enrich with Form 990 data | |
| s3 = boto3.client('s3', region_name='us-east-1') | |
| parser = Form990Parser() | |
| enriched = [] | |
| for idx, org in health_orgs.iterrows(): | |
| ein = org['ein'] | |
| # Try to find most recent 990 | |
| try: | |
| # Construct likely S3 key (simplified - actual naming varies) | |
| key = f"{ein}_202312_990.xml" | |
| xml_obj = s3.get_object(Bucket='irs-form-990', Key=key) | |
| xml_content = xml_obj['Body'].read() | |
| filing_data = parser.parse_xml(xml_content) | |
| # Merge with org data | |
| org_enriched = org.to_dict() | |
| org_enriched.update(filing_data) | |
| enriched.append(org_enriched) | |
| except Exception as e: | |
| # No 990 found for this org | |
| continue | |
| enriched_df = pd.DataFrame(enriched) | |
| enriched_df.to_parquet('data/gold/nonprofits_alabama_health_990.parquet') | |
| print(f"Enriched {len(enriched_df):,} organizations with Form 990 data") | |
| ``` | |
| ## ๐พ Data Schema | |
| ### Form 990 Parser Output | |
| ```python | |
| { | |
| # Basic Info | |
| 'ein': '123456789', | |
| 'organization_name': 'Alabama Oral Health Foundation', | |
| 'tax_year': 2023, | |
| 'tax_period': '202312', | |
| # Financials | |
| 'total_revenue': 2500000, | |
| 'total_expenses': 2100000, | |
| 'net_income': 400000, | |
| 'total_assets': 5000000, | |
| 'total_liabilities': 500000, | |
| # Revenue Breakdown | |
| 'contributions_grants': 1200000, | |
| 'program_service_revenue': 800000, | |
| 'investment_income': 300000, | |
| 'other_revenue': 200000, | |
| # Expense Breakdown | |
| 'program_expenses': 1800000, | |
| 'administrative_expenses': 200000, | |
| 'fundraising_expenses': 100000, | |
| # Programs | |
| 'program_service_descriptions': [ | |
| { | |
| 'description': 'Community dental clinics', | |
| 'expenses': 1000000, | |
| 'grants': 200000 | |
| }, | |
| { | |
| 'description': 'School fluoride programs', | |
| 'expenses': 500000, | |
| 'grants': 100000 | |
| } | |
| ], | |
| # Governance | |
| 'officers': [ | |
| { | |
| 'name': 'Sarah Johnson', | |
| 'title': 'CEO', | |
| 'compensation': 150000 | |
| }, | |
| { | |
| 'name': 'John Smith', | |
| 'title': 'CFO', | |
| 'compensation': 120000 | |
| } | |
| ], | |
| # Mission | |
| 'mission_statement': 'Improve oral health access in underserved communities...', | |
| 'program_accomplishments': 'Served 10,000 patients in 2023...' | |
| } | |
| ``` | |
| ## ๐ Performance Considerations | |
| ### Data Volume | |
| - **Form 990 XMLs**: ~300,000 new filings per year | |
| - **Average XML size**: 500KB - 5MB | |
| - **Total storage**: ~500GB for all historical 990s (2011-present) | |
| ### Processing Speed | |
| **Sequential (current approach):** | |
| - Download + parse: ~2-5 seconds per 990 | |
| - 300,000 filings ร 3 sec = **250 hours** ๐ฑ | |
| **Async parallel (recommended):** | |
| - 50 concurrent workers | |
| - 300,000 filings ร 3 sec / 50 = **5 hours** โก | |
| ### Smart Strategies | |
| 1. **Filter first**: Only download 990s for organizations you care about | |
| ```python | |
| # Only health orgs in your states | |
| health_eins = orgs[ | |
| (orgs['state'].isin(['AL', 'MI'])) & | |
| (orgs['ntee_code'].str.startswith('E')) | |
| ]['ein'].tolist() | |
| # Result: ~50,000 instead of 300,000 = 1 hour | |
| ``` | |
| 2. **Use index files**: Download the index first, filter, then fetch XMLs | |
| ```python | |
| # Get index | |
| index = pd.read_json('https://s3.amazonaws.com/irs-form-990/index_2023.json') | |
| # Filter to your EINs | |
| relevant = index[index['EIN'].isin(health_eins)] | |
| # Only download these | |
| for url in relevant['URL']: | |
| # download and parse | |
| ``` | |
| 3. **Cache aggressively**: 990s don't change after filing | |
| ```python | |
| cache_dir = Path('data/cache/form_990_xml') | |
| cache_file = cache_dir / f"{ein}_{tax_year}.parquet" | |
| if cache_file.exists(): | |
| return pd.read_parquet(cache_file) | |
| else: | |
| # download, parse, cache | |
| ``` | |
| ## ๐ฏ Use Cases | |
| ### 1. Financial Health Analysis | |
| ```python | |
| # Which nonprofits are most financially stable? | |
| df['efficiency_ratio'] = df['program_expenses'] / df['total_expenses'] | |
| df['reserve_months'] = df['net_assets'] / (df['total_expenses'] / 12) | |
| efficient = df[df['efficiency_ratio'] > 0.75] # >75% on programs | |
| print(f"Efficient organizations: {len(efficient):,}") | |
| ``` | |
| ### 2. Grant Research | |
| ```python | |
| # Who's giving grants in oral health? | |
| grantmakers = df[ | |
| (df['grants_paid'] > 0) & | |
| (df['ntee_code'].str.startswith('E')) | |
| ] | |
| print(f"Oral health grantmakers: {len(grantmakers):,}") | |
| print(f"Total grants: ${grantmakers['grants_paid'].sum():,.0f}") | |
| ``` | |
| ### 3. Program Discovery | |
| ```python | |
| # Find organizations running specific programs | |
| fluoride_programs = df[ | |
| df['program_service_descriptions'].str.contains('fluoride', case=False, na=False) | |
| ] | |
| print(f"Orgs with fluoride programs: {len(fluoride_programs):,}") | |
| ``` | |
| ## ๐ Next Steps | |
| ### Quick Test | |
| ```bash | |
| # Install dependencies | |
| pip install form-990-xml-parser boto3 | |
| # Test with single organization | |
| python -c " | |
| from form990_parser import Form990Parser | |
| import boto3 | |
| s3 = boto3.client('s3', region_name='us-east-1') | |
| ein = '631307851' # Delta Dental of Alabama (example) | |
| try: | |
| # Try to fetch 2023 filing | |
| key = f'{ein}_202312_990.xml' | |
| obj = s3.get_object(Bucket='irs-form-990', Key=key) | |
| parser = Form990Parser() | |
| data = parser.parse_xml(obj['Body'].read()) | |
| print(f'Organization: {data.get(\"organization_name\")}') | |
| print(f'Revenue: \${data.get(\"total_revenue\", 0):,.0f}') | |
| print(f'Assets: \${data.get(\"total_assets\", 0):,.0f}') | |
| except Exception as e: | |
| print(f'Error: {e}') | |
| " | |
| ``` | |
| ### Full Integration | |
| Create a new enrichment pipeline: | |
| ```bash | |
| # Create new script | |
| python scripts/enrich_nonprofits_form990.py \ | |
| --input data/gold/nonprofits_organizations.parquet \ | |
| --output data/gold/nonprofits_organizations_990.parquet \ | |
| --states AL MI \ | |
| --ntee E \ | |
| --concurrent 50 | |
| ``` | |
| ## ๐ Resources | |
| - **Giving Tuesday GitHub**: https://github.com/Giving-Tuesday | |
| - **Form 990 XML Parser**: https://github.com/Giving-Tuesday/form-990-xml-parser | |
| - **Form 990 XML Mapper**: https://github.com/Giving-Tuesday/form-990-xml-mapper | |
| - **IRS 990 AWS Bucket**: https://registry.opendata.aws/irs990/ | |
| - **IRS Index Files**: https://s3.amazonaws.com/irs-form-990/index_YYYY.json | |
| - **990 Schema Documentation**: https://www.irs.gov/e-file-providers/current-valid-xml-schemas-and-business-rules | |
| ## โ FAQ | |
| ### Q: Are we currently downloading XML? | |
| **A: No.** Currently using EO-BMF CSV files (basic data). Form 990 XML would be an enhancement for detailed financials. | |
| ### Q: Can we use Giving Tuesday libraries? | |
| **A: Yes!** They're open source and designed exactly for this purpose. Would provide much richer data. | |
| ### Q: How much data is it? | |
| **A:** | |
| - All 990s (2011-present): ~500GB | |
| - Alabama only: ~5GB | |
| - Alabama health orgs: ~500MB | |
| - Very manageable! | |
| ### Q: What's the license? | |
| **A:** Public domain (U.S. Government data) + Giving Tuesday tools are open source | |
| ### Q: Integration effort? | |
| **A:** Low - can reuse existing async enrichment patterns. Estimated: 1-2 days for initial integration. | |