jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
# NCCS (National Center for Charitable Statistics) Data Scripts
Scripts for downloading and working with nonprofit data from the National Center for Charitable Statistics at the Urban Institute.
## Data Source
- **Organization**: National Center for Charitable Statistics (NCCS), Urban Institute
- **Website**: https://nccs.urban.org/
- **Catalog**: https://urbaninstitute.github.io/nccs/catalogs/catalog-bmf.html
- **Coverage**: Tax-exempt organizations (1989-present)
- **Data Types**: Unified BMF, Transformed BMF, Raw BMF archives
## Scripts
### `bulk_download_nccs.py`
Download all NCCS BMF (Business Master File) datasets with organized directory structure.
**Features:**
- Downloads Unified BMF (by state or full file)
- Downloads Transformed BMF (monthly cleaned data)
- Downloads Raw BMF archives (unmodified IRS files)
- Resume interrupted downloads
- Progress tracking and logging
- Filter by states or months
**Usage:**
```bash
# Download everything to /mnt/d/nccs_data/
python bulk_download_nccs.py
# Download to custom directory
python bulk_download_nccs.py --base-dir /path/to/directory
# Download only Unified BMF
python bulk_download_nccs.py --dataset unified
# Download specific states only
python bulk_download_nccs.py --dataset unified --states CA,NY,TX,FL
# Download only recent transformed BMF
python bulk_download_nccs.py --dataset transformed --months 2025_12,2026_01
# Skip full unified file (only download state files)
python bulk_download_nccs.py --dataset unified --no-full --states CA,TX
# Resume interrupted download
python bulk_download_nccs.py --resume
# Dry run (show what would be downloaded)
python bulk_download_nccs.py --dry-run
```
**Output Structure:**
```
/mnt/d/nccs_data/
β”œβ”€β”€ unified-bmf/
β”‚ └── v1.2/
β”‚ β”œβ”€β”€ full/
β”‚ β”‚ └── UNIFIED_BMF_V1.2.csv (All states combined)
β”‚ β”œβ”€β”€ by-state/
β”‚ β”‚ β”œβ”€β”€ AL.csv
β”‚ β”‚ β”œβ”€β”€ CA.csv
β”‚ β”‚ β”œβ”€β”€ NY.csv
β”‚ └── ... (56 files: 50 states + DC + 5 territories)
β”‚ └── data-dictionary/
β”‚ └── harmonized_data_dictionary.xlsx
β”œβ”€β”€ transformed-bmf/
β”‚ β”œβ”€β”€ 2023_06/
β”‚ β”‚ β”œβ”€β”€ bmf_2023_06_processed.csv
β”‚ β”‚ └── bmf_2023_06_data_dictionary.csv
β”‚ β”œβ”€β”€ 2025_12/
β”‚ β”‚ β”œβ”€β”€ bmf_2025_12_processed.csv
β”‚ β”‚ └── bmf_2025_12_data_dictionary.csv
β”‚ └── ... (Monthly from June 2023-Jan 2026)
β”œβ”€β”€ raw-bmf/
β”‚ β”œβ”€β”€ 2023-06-BMF.csv
β”‚ β”œβ”€β”€ 2025-12-BMF.csv
β”‚ └── ... (Monthly from June 2023-Jan 2026)
└── download_log.json
```
## Datasets
### Unified BMF (Recommended for Longitudinal Analysis)
**What it is:**
- Consolidates all historical BMF releases into a single file
- One row per organization that has ever held tax-exempt status
- Enables longitudinal analysis without merging multiple annual files
**Key Features:**
- `ORG_YEAR_FIRST` and `ORG_YEAR_LAST` variables tracking organizational lifecycle
- Most recent address geocoded to Census block
- FIPS codes at block, tract, county, and state levels
- Metropolitan area codes using current CBSA definitions
- AI/Lakehouse optimized format
**Coverage:** 1989 through mid-2025 (update pending)
**Use When:**
- You need to track organizations over time
- Building historical sampling frames
- Linking nonprofit data to Census geographies
- Analyzing organizational entry/exit patterns
- Metropolitan vs rural nonprofit analysis
**File Sizes:**
- Full file: ~1.5 GB (all states combined)
- By state: 0.1 MB (territories) to 149.5 MB (California)
- **Note:** 'ZZ' (Unmapped) is not available as a separate file from NCCS
### Transformed BMF (Recommended for Current Analysis)
**What it is:**
- Monthly IRS releases with standardized cleaning and validation
- Consistent column names and quality flags
- Documented transformations
**Key Features:**
- Standardized field names
- Quality flags identifying potential data issues
- Documentation of all transformations applied
- Monthly updates
**Coverage:** June 2023 to present (monthly snapshots)
**Use When:**
- You need current BMF data with consistent formatting
- You want documented quality checks
- Working with monthly snapshots
**File Sizes:** ~50-150 MB per month
### Raw BMF Archives (For Replication Studies)
**What it is:**
- Unmodified monthly BMF files as released by the IRS
- Original IRS schema and variable names
**Coverage:** June 2023 to present (monthly snapshots)
**Use When:**
- Replicating analysis built on raw IRS files
- Need data exactly as IRS published it
- Require specific point-in-time snapshot
**File Sizes:** ~100-200 MB per month
## Key Data Fields
### Geographic Fields (Unified BMF)
- **FIPS Codes**: Block, Tract, County, State
- **CBSA Codes**: Core Based Statistical Area (Metropolitan/Rural)
- **Geocoded Address**: Census block level precision
### Temporal Fields (Unified BMF)
- **ORG_YEAR_FIRST**: When organization first appeared in BMF
- **ORG_YEAR_LAST**: When organization last appeared (or current if still active)
### Organization Fields
- **EIN**: Employer Identification Number (unique ID)
- **NAME**: Organization name
- **NTEE_CODE**: National Taxonomy of Exempt Entities classification
- **SUBSECTION**: IRS subsection (501(c)(3), 501(c)(4), etc.)
- **FINANCIAL_DATA**: Revenue, assets, expenses
- **ADDRESS**: Street, city, state, ZIP
## Census Integration
The Unified BMF is specifically designed for Census data integration:
```python
import pandas as pd
# Load Unified BMF for a state
bmf = pd.read_csv('/mnt/d/nccs_data/unified-bmf/v1.2/by-state/CA.csv')
# Load Census data (example: ACS demographic data)
census = pd.read_csv('census_tract_data.csv')
# Merge on FIPS tract code
merged = bmf.merge(census, left_on='FIPS_TRACT', right_on='GEOID', how='left')
# Now analyze nonprofits by demographic characteristics
analysis = merged.groupby('NTEE_CODE').agg({
'MEDIAN_INCOME': 'mean',
'POPULATION': 'sum',
'EIN': 'count'
})
```
## Related Resources
- **NCCS Data Archive**: https://nccs.urban.org/nccs-data-archive
- **NCCS Census Crosswalk**: For aggregating to additional geographic levels
- **BMF Processing Guide**: https://urbaninstitute.github.io/nccs-data-bmf/
- **Source Code**: https://github.com/UrbanInstitute/nccs-data-bmf
- **IRS Data Dictionary**: https://www.irs.gov/pub/irs-soi/eo-info.pdf
## Attribution
When using NCCS data, please cite:
1. **National Center for Charitable Statistics**, Urban Institute
2. **IRS Business Master File** (original data source)
3. Specify the **data vintage/update date** used
**Example Citation:**
```
National Center for Charitable Statistics (2026). Unified Business Master File (BMF), v1.2.
Retrieved from https://nccs.urban.org/. Original data: IRS Exempt Organizations Business Master File.
```
## Support
- **NCCS Contact**: https://nccs.urban.org/nccs/contact/
- **Documentation Issues**: https://github.com/UrbanInstitute/nccs-data-bmf/issues