Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # NCCS (National Center for Charitable Statistics) Data Scripts | |
| Scripts for downloading and working with nonprofit data from the National Center for Charitable Statistics at the Urban Institute. | |
| ## Data Source | |
| - **Organization**: National Center for Charitable Statistics (NCCS), Urban Institute | |
| - **Website**: https://nccs.urban.org/ | |
| - **Catalog**: https://urbaninstitute.github.io/nccs/catalogs/catalog-bmf.html | |
| - **Coverage**: Tax-exempt organizations (1989-present) | |
| - **Data Types**: Unified BMF, Transformed BMF, Raw BMF archives | |
| ## Scripts | |
| ### `bulk_download_nccs.py` | |
| Download all NCCS BMF (Business Master File) datasets with organized directory structure. | |
| **Features:** | |
| - Downloads Unified BMF (by state or full file) | |
| - Downloads Transformed BMF (monthly cleaned data) | |
| - Downloads Raw BMF archives (unmodified IRS files) | |
| - Resume interrupted downloads | |
| - Progress tracking and logging | |
| - Filter by states or months | |
| **Usage:** | |
| ```bash | |
| # Download everything to /mnt/d/nccs_data/ | |
| python bulk_download_nccs.py | |
| # Download to custom directory | |
| python bulk_download_nccs.py --base-dir /path/to/directory | |
| # Download only Unified BMF | |
| python bulk_download_nccs.py --dataset unified | |
| # Download specific states only | |
| python bulk_download_nccs.py --dataset unified --states CA,NY,TX,FL | |
| # Download only recent transformed BMF | |
| python bulk_download_nccs.py --dataset transformed --months 2025_12,2026_01 | |
| # Skip full unified file (only download state files) | |
| python bulk_download_nccs.py --dataset unified --no-full --states CA,TX | |
| # Resume interrupted download | |
| python bulk_download_nccs.py --resume | |
| # Dry run (show what would be downloaded) | |
| python bulk_download_nccs.py --dry-run | |
| ``` | |
| **Output Structure:** | |
| ``` | |
| /mnt/d/nccs_data/ | |
| βββ unified-bmf/ | |
| β βββ v1.2/ | |
| β βββ full/ | |
| β β βββ UNIFIED_BMF_V1.2.csv (All states combined) | |
| β βββ by-state/ | |
| β β βββ AL.csv | |
| β β βββ CA.csv | |
| β β βββ NY.csv | |
| β βββ ... (56 files: 50 states + DC + 5 territories) | |
| β βββ data-dictionary/ | |
| β βββ harmonized_data_dictionary.xlsx | |
| βββ transformed-bmf/ | |
| β βββ 2023_06/ | |
| β β βββ bmf_2023_06_processed.csv | |
| β β βββ bmf_2023_06_data_dictionary.csv | |
| β βββ 2025_12/ | |
| β β βββ bmf_2025_12_processed.csv | |
| β β βββ bmf_2025_12_data_dictionary.csv | |
| β βββ ... (Monthly from June 2023-Jan 2026) | |
| βββ raw-bmf/ | |
| β βββ 2023-06-BMF.csv | |
| β βββ 2025-12-BMF.csv | |
| β βββ ... (Monthly from June 2023-Jan 2026) | |
| βββ download_log.json | |
| ``` | |
| ## Datasets | |
| ### Unified BMF (Recommended for Longitudinal Analysis) | |
| **What it is:** | |
| - Consolidates all historical BMF releases into a single file | |
| - One row per organization that has ever held tax-exempt status | |
| - Enables longitudinal analysis without merging multiple annual files | |
| **Key Features:** | |
| - `ORG_YEAR_FIRST` and `ORG_YEAR_LAST` variables tracking organizational lifecycle | |
| - Most recent address geocoded to Census block | |
| - FIPS codes at block, tract, county, and state levels | |
| - Metropolitan area codes using current CBSA definitions | |
| - AI/Lakehouse optimized format | |
| **Coverage:** 1989 through mid-2025 (update pending) | |
| **Use When:** | |
| - You need to track organizations over time | |
| - Building historical sampling frames | |
| - Linking nonprofit data to Census geographies | |
| - Analyzing organizational entry/exit patterns | |
| - Metropolitan vs rural nonprofit analysis | |
| **File Sizes:** | |
| - Full file: ~1.5 GB (all states combined) | |
| - By state: 0.1 MB (territories) to 149.5 MB (California) | |
| - **Note:** 'ZZ' (Unmapped) is not available as a separate file from NCCS | |
| ### Transformed BMF (Recommended for Current Analysis) | |
| **What it is:** | |
| - Monthly IRS releases with standardized cleaning and validation | |
| - Consistent column names and quality flags | |
| - Documented transformations | |
| **Key Features:** | |
| - Standardized field names | |
| - Quality flags identifying potential data issues | |
| - Documentation of all transformations applied | |
| - Monthly updates | |
| **Coverage:** June 2023 to present (monthly snapshots) | |
| **Use When:** | |
| - You need current BMF data with consistent formatting | |
| - You want documented quality checks | |
| - Working with monthly snapshots | |
| **File Sizes:** ~50-150 MB per month | |
| ### Raw BMF Archives (For Replication Studies) | |
| **What it is:** | |
| - Unmodified monthly BMF files as released by the IRS | |
| - Original IRS schema and variable names | |
| **Coverage:** June 2023 to present (monthly snapshots) | |
| **Use When:** | |
| - Replicating analysis built on raw IRS files | |
| - Need data exactly as IRS published it | |
| - Require specific point-in-time snapshot | |
| **File Sizes:** ~100-200 MB per month | |
| ## Key Data Fields | |
| ### Geographic Fields (Unified BMF) | |
| - **FIPS Codes**: Block, Tract, County, State | |
| - **CBSA Codes**: Core Based Statistical Area (Metropolitan/Rural) | |
| - **Geocoded Address**: Census block level precision | |
| ### Temporal Fields (Unified BMF) | |
| - **ORG_YEAR_FIRST**: When organization first appeared in BMF | |
| - **ORG_YEAR_LAST**: When organization last appeared (or current if still active) | |
| ### Organization Fields | |
| - **EIN**: Employer Identification Number (unique ID) | |
| - **NAME**: Organization name | |
| - **NTEE_CODE**: National Taxonomy of Exempt Entities classification | |
| - **SUBSECTION**: IRS subsection (501(c)(3), 501(c)(4), etc.) | |
| - **FINANCIAL_DATA**: Revenue, assets, expenses | |
| - **ADDRESS**: Street, city, state, ZIP | |
| ## Census Integration | |
| The Unified BMF is specifically designed for Census data integration: | |
| ```python | |
| import pandas as pd | |
| # Load Unified BMF for a state | |
| bmf = pd.read_csv('/mnt/d/nccs_data/unified-bmf/v1.2/by-state/CA.csv') | |
| # Load Census data (example: ACS demographic data) | |
| census = pd.read_csv('census_tract_data.csv') | |
| # Merge on FIPS tract code | |
| merged = bmf.merge(census, left_on='FIPS_TRACT', right_on='GEOID', how='left') | |
| # Now analyze nonprofits by demographic characteristics | |
| analysis = merged.groupby('NTEE_CODE').agg({ | |
| 'MEDIAN_INCOME': 'mean', | |
| 'POPULATION': 'sum', | |
| 'EIN': 'count' | |
| }) | |
| ``` | |
| ## Related Resources | |
| - **NCCS Data Archive**: https://nccs.urban.org/nccs-data-archive | |
| - **NCCS Census Crosswalk**: For aggregating to additional geographic levels | |
| - **BMF Processing Guide**: https://urbaninstitute.github.io/nccs-data-bmf/ | |
| - **Source Code**: https://github.com/UrbanInstitute/nccs-data-bmf | |
| - **IRS Data Dictionary**: https://www.irs.gov/pub/irs-soi/eo-info.pdf | |
| ## Attribution | |
| When using NCCS data, please cite: | |
| 1. **National Center for Charitable Statistics**, Urban Institute | |
| 2. **IRS Business Master File** (original data source) | |
| 3. Specify the **data vintage/update date** used | |
| **Example Citation:** | |
| ``` | |
| National Center for Charitable Statistics (2026). Unified Business Master File (BMF), v1.2. | |
| Retrieved from https://nccs.urban.org/. Original data: IRS Exempt Organizations Business Master File. | |
| ``` | |
| ## Support | |
| - **NCCS Contact**: https://nccs.urban.org/nccs/contact/ | |
| - **Documentation Issues**: https://github.com/UrbanInstitute/nccs-data-bmf/issues | |