open-navigator / docs /CENSUS_DATA_FIX.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# Census Bureau Data URL Fix
## Problem
The original Census Bureau data URLs were returning 404 errors because the data structure changed.
## Solution
### Updated URLs (2022 Census of Governments)
The Census Bureau publishes data as **ZIP files containing Excel spreadsheets**, not direct CSV files.
**New URLs:**
- **Counties**: https://www2.census.gov/programs-surveys/gus/tables/2022/cog2022_cg2200org05.zip
- **Municipalities**: https://www2.census.gov/programs-surveys/gus/tables/2022/cog2022_cg2200org06.zip
- **School Districts**: https://www2.census.gov/programs-surveys/gus/tables/2022/cog2022_cg2200org09.zip
- **Special Districts**: https://www2.census.gov/programs-surveys/gus/tables/2022/cog2022_cg2200org08.zip
### Required Dependencies
To process Excel files from Census Bureau:
```bash
pip install openpyxl
```
### How It Works
1. **Downloads ZIP file** from Census Bureau
2. **Extracts Excel file** (.xlsx) from ZIP
3. **Converts to CSV** using pandas
4. **Caches locally** (7-day cache)
### Installation
```bash
source venv/bin/activate
pip install pyspark delta-spark openpyxl
```
### Usage
```bash
python main.py discover-jurisdictions --limit 10
```
The system will:
- Download Census ZIP files automatically
- Extract and convert Excel → CSV
- Cache for 7 days to avoid re-downloading
- Process jurisdiction data into Delta Lake
---
## Data Source Reference
**Official Page**: https://www.census.gov/data/tables/2022/econ/gus/2022-governments.html
**Available Tables:**
- Table 2: Local Governments by Type and State
- Table 5: County Governments by Population-Size Group
- Table 6: Subcounty General-Purpose Governments
- Table 8: Special District Governments by Function
- Table 9: Public School Systems by Type
**Update Frequency**: Census of Governments runs every 5 years (2017, 2022, 2027...)
**Next Update**: 2027 Census of Governments
---
## Troubleshooting
### Missing openpyxl
```
ModuleNotFoundError: No module named 'openpyxl'
```
**Fix**: `pip install openpyxl`
### ZIP Extraction Fails
Check disk space in `data/cache/census/` directory
### Still Getting 404
The Census Bureau may have moved files. Check:
https://www.census.gov/programs-surveys/gus/data/datasets.html
---
## Alternative: Manual Download
If automated download fails:
1. Visit: https://www.census.gov/data/tables/2022/econ/gus/2022-governments.html
2. Download ZIP files manually
3. Extract Excel files
4. Place in `data/cache/census/` as:
- `counties_20260421.csv`
- `municipalities_20260421.csv`
- etc.
The system will use cached files automatically.