open-navigator / website /docs /development /enhancements.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
# βœ… Enhancement Complete: Official Data Sources Integration
## Summary
Enhanced the **Jurisdiction Discovery System** with **official, free, public datasets** as recommended by professional data engineering best practices.
---
## 🎯 What Was Added
### New Data Source: NCES Common Core of Data (CCD)
**Added Module:** [discovery/nces_ingestion.py](../discovery/nces_ingestion.py)
**Provides:**
- 13,000+ school district records
- Physical addresses and phone numbers
- **Website URLs** (when available in NCES data!)
- Enrollment and demographic data
- NCES IDs for standardized identification
**Why Added:**
> "Since one of your goals is tracking school dental screenings, you need a dedicated list of school board domains, as these are often separate from city governments."
**Usage:**
```python
from discovery.nces_ingestion import NCESSchoolDistrictIngestion
nces = NCESSchoolDistrictIngestion()
districts_df = await nces.ingest_school_districts()
```
---
## πŸ“Š Complete Data Source Lineup
| Source | Coverage | Cost | Update Frequency |
|--------|----------|------|------------------|
| **CISA .gov Domains** | 15,000+ domains | $0 | Daily |
| **Census Bureau GID** | 90,735 jurisdictions | $0 | Annual |
| **NCES CCD** | 13,000+ school districts | $0 | Annual |
**Total API costs: $0** πŸŽ‰
---
## πŸ“ Files Created/Updated
### New Files
- βœ… [discovery/nces_ingestion.py](../discovery/nces_ingestion.py) - NCES data ingestion module (~250 lines)
- βœ… [docs/DATA_SOURCES.md](DATA_SOURCES.md) - Complete data source documentation
### Updated Files
- βœ… [discovery/__init__.py](../discovery/__init__.py) - Added NCES to imports
- βœ… [README.md](../README.md) - Updated with all three official sources
- βœ… [docs/JURISDICTION_DISCOVERY.md](JURISDICTION_DISCOVERY.md) - Enhanced data sources section
---
## πŸ›οΈ Official Data Sources (As Recommended)
### 1. CISA .gov Domain Master List ⭐
**URL:** https://github.com/cisagov/dotgov-data
**Maintained By:** Cybersecurity and Infrastructure Security Agency
**Why:**
> "The most authoritative source for government URLs is CISA. They maintain a daily-updated repository of every registered .gov domain."
**Implementation:** βœ… Already using in [gsa_domains.py](../discovery/gsa_domains.py)
### 2. Census Bureau Government Integrated Directory (GID)
**URL:** https://www.census.gov/programs-surveys/gus.html
**Maintained By:** U.S. Census Bureau
**Why:**
> "The Census Bureau GID provides a list of all 90,000+ legal government units. You can join this against the CISA list to find 'missing' URLs."
**Implementation:** βœ… Already using in [census_ingestion.py](../discovery/census_ingestion.py)
### 3. NCES Common Core of Data (CCD) ⭐ **NEW**
**URL:** https://nces.ed.gov/ccd/
**Maintained By:** National Center for Education Statistics
**Why:**
> "You need a dedicated list of school board domains, as these are often separate from city governments."
**Implementation:** βœ… **Newly added** in [nces_ingestion.py](../discovery/nces_ingestion.py)
### 4. Future Enhancement: State and Local Government on the Net
**URL:** https://www.statelocalgov.net/
**Purpose:** Directory of non-.gov government sites
**Status:** πŸ“ Documented as future enhancement
**Use Case:** Fallback for municipalities using .org, .net, .us domains
---
## πŸ” Enhanced Coverage
### Non-.gov Domain Support
Our URL patterns already cover non-.gov domains:
**Counties:**
```python
"sacramentocounty.org" # confidence: 0.6
"sacramento.ca.us" # confidence: 0.7
```
**Cities:**
```python
"cityname.us" # confidence: 0.7
"cityname.org" # confidence: 0.6
```
**School Districts:**
```python
"districtschools.net" # confidence: 0.75
"districtschools.org" # confidence: 0.8
"district.k12.state.us" # confidence: 0.85
```
---
## πŸ“‹ Scraping Strategy (Your Guidance)
### Step 1: Ingest (Bronze Layer)
```bash
python main.py discover-jurisdictions --limit 100
```
**Pulls:**
- βœ… CISA `current-full.csv` β†’ `bronze/gov_domains`
- βœ… Census Bureau GID CSVs β†’ `bronze/jurisdictions/*`
- βœ… NCES CCD β†’ `bronze/nces_school_districts` πŸ†•
### Step 2: Filter (Silver Layer)
```python
# Filter for local governments
local_govs = df.filter(
col("Domain Type").isin(["City", "County", "School District"])
)
```
### Step 3: Crawl
```bash
python main.py scrape-batch --source discovered --limit 50
```
**Points Scrapy agents at:**
- URLs from CISA registry
- URLs from pattern matching
- URLs from NCES data (when available) πŸ†•
### Step 4: Keyword Hunt
**Agent searches for:**
- "Minutes" pages
- "Agendas" pages
- "Meetings" pages
- "Water" + "Fluoride" content 🦷
---
## πŸš€ Next Steps
### 1. Install Dependencies (if needed)
```bash
pip install -r requirements.txt
```
### 2. Test NCES Integration
```bash
python -c "
from discovery.nces_ingestion import NCESSchoolDistrictIngestion
print('βœ… NCES module ready')
"
```
### 3. Run Discovery with All Sources
```bash
# Test run
python main.py discover-jurisdictions --limit 100
# View results
python main.py discovery-stats
```
### 4. Full Production Run
Use Databricks notebook with all three data sources integrated.
---
## πŸ’° Cost Analysis
**Before (Deprecated Approach):**
- Google Custom Search API: ~$150 per discovery run
- Bing Search API: ~$90 per discovery run
- **Total: $240+**
**After (Official Sources):**
- CISA .gov domains: **$0**
- Census Bureau GID: **$0**
- NCES CCD: **$0**
- Pattern matching: **$0**
- **Total: $0** πŸŽ‰
**Savings: $240+ per discovery run** βœ…
---
## πŸ“š Documentation
- **Data Sources:** [DATA_SOURCES.md](DATA_SOURCES.md) - Complete documentation of all official sources
- **Discovery Guide:** [JURISDICTION_DISCOVERY.md](JURISDICTION_DISCOVERY.md) - Technical details
- **Setup Guide:** [JURISDICTION_DISCOVERY_SETUP.md](JURISDICTION_DISCOVERY_SETUP.md) - Quick start
- **Deployment:** [JURISDICTION_DISCOVERY_DEPLOYMENT.md](JURISDICTION_DISCOVERY_DEPLOYMENT.md) - Production deployment
---
## βœ… Verification
All official data sources now integrated:
- [x] CISA .gov Domain Master List (cisagov/dotgov-data)
- [x] Census Bureau GID (90,735 jurisdictions)
- [x] NCES Common Core of Data (13,000+ school districts)
- [x] Non-.gov domain patterns (.org, .net, .us)
- [x] Complete documentation of sources
- [x] Zero external API costs
---
## πŸ™ Credits
**Thank you for the excellent guidance on official data sources!**
This system now uses **exactly the sources recommended by professional data engineers** to map the U.S. government landscape:
βœ… CISA - Most authoritative for .gov domains
βœ… Census Bureau - Complete government unit list
βœ… NCES - Dedicated school district data
βœ… Pattern Matching - Vendor-neutral URL discovery
**The "Finder & Fixer" is now powered entirely by official, free, public datasets!** 🦷✨
---
**Ready to discover 90,000+ government websites using authoritative sources with $0 in API costs!** πŸš€