open-navigator / docs /DATA_SOURCES.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

Official Data Sources for Jurisdiction Discovery

This document credits the official, free, public datasets used by the Oral Health Policy Pulse jurisdiction discovery system.


πŸ›οΈ Primary Data Sources

1. CISA .gov Domain Master List ⭐ Most Authoritative

Source: Cybersecurity and Infrastructure Security Agency (CISA)
URL: https://github.com/cisagov/dotgov-data
File: current-full.csv (updated daily!)

What It Contains:

  • 15,000+ registered .gov domains
  • Domain Type: City, County, State, Tribal, School District
  • Organization names and locations
  • Security contacts and registration dates

Why We Use It:

"The most authoritative source for government URLs is CISA. They maintain a daily-updated repository of every registered .gov domain."

How We Use It:

# Direct download from GitHub
from discovery.gsa_domains import GSADomainList

gsa = GSADomainList()
domains_df = await gsa.download_domain_list()

Lakehouse Strategy:

  1. Ingest to Bronze Layer (bronze/gov_domains)
  2. Filter by Domain Type for targeted scraping (City, County)
  3. Use for exact matching (confidence: 0.95-1.0)
  4. Use for fuzzy matching with 75%+ similarity

2. U.S. Census Bureau - Government Integrated Directory (GID)

Source: U.S. Census Bureau, Government Statistics
URL: https://www.census.gov/programs-surveys/gus.html
Dataset: 2022 Census of Governments

What It Contains:

  • 90,735 total government units
    • 3,143 counties
    • 19,495 municipalities (cities/towns)
    • 16,504 townships
    • 13,051 school districts
    • 38,542 special districts
  • FIPS codes (standardized IDs)
  • Population data
  • Geographic hierarchy (state, county, place)

Why We Use It:

"The Census Bureau GID provides a list of all 90,000+ legal government units. You can join this against the CISA list to find 'missing' URLs that your agent needs to hunt for."

How We Use It:

from discovery.census_ingestion import CensusGovernmentIngestion

census = CensusGovernmentIngestion()
dfs = await census.ingest_all_jurisdictions()

Lakehouse Strategy:

  1. Ingest to Bronze Layer (bronze/jurisdictions/{type})
  2. Create unified view with all jurisdiction types
  3. Join with CISA to identify missing URLs
  4. Prioritize by population for scraping

3. NCES Common Core of Data (CCD)

Source: National Center for Education Statistics (NCES)
URL: https://nces.ed.gov/ccd/
Dataset: Local Education Agency (LEA) Universe Survey

What It Contains:

  • 13,000+ school districts
  • Official district names and NCES IDs
  • Physical addresses and phone numbers
  • Website URLs (when available)
  • Enrollment and demographic data
  • District type (Regular, Charter, etc.)

Why We Use It:

"Since one of your goals is tracking school dental screenings, you need a dedicated list of school board domains, as these are often separate from city governments."

How We Use It:

from discovery.nces_ingestion import NCESSchoolDistrictIngestion

nces = NCESSchoolDistrictIngestion()
districts_df = await nces.ingest_school_districts()

Lakehouse Strategy:

  1. Ingest to Bronze Layer (bronze/nces_school_districts)
  2. Extract provided URLs (many NCES records include website field!)
  3. Use district names to generate URL patterns for missing sites
  4. Common pattern: {district}.k12.{state}.us

πŸ“‹ Summary Table: Where to Pull the Lists

Jurisdiction Type Primary Free Source Format Coverage
All Official .gov CISA dotgov-data CSV / GitHub 15,000+ domains
School Districts NCES CCD Data CSV 13,000+ districts
Counties/Cities Census Bureau GID CSV 22,638 jurisdictions
Townships Census Bureau GID CSV 16,504 townships
Special Districts Census Bureau GID CSV 38,542 districts
State Legislatures LegiScan API JSON / API 50 states

πŸ” Scraping Strategy (Based on Your Guidance)

Step 1: Ingest

python main.py init  # Initialize Delta Lake
python main.py discover-jurisdictions --limit 100  # Test run

Pulls:

  • βœ… current-full.csv from CISA β†’ Bronze layer
  • βœ… Census GID CSVs β†’ Bronze layer
  • βœ… NCES CCD data β†’ Bronze layer

Step 2: Filter

# Create Silver layer table
df = spark.read.format("delta").load("bronze/gov_domains")

# Filter for local governments
local_govs = df.filter(
    col("Domain Type").isin(["City", "County", "School District"])
)

Result: ~8,000-10,000 high-priority targets

Step 3: Crawl

python main.py scrape-batch --source discovered --limit 50

Points Scrapy agents at discovered URLs:

  • Homepage URLs from CISA + pattern matching
  • Verified with HTTP HEAD/GET requests
  • Prioritized by population and domain type

Step 4: Keyword Hunt

Agent searches for:

  • "Minutes" pages
  • "Agendas" pages
  • "Meetings" pages
  • "Water" + "Fluoride" content

CMS Detection:

  • Granicus
  • CivicClerk
  • Municode
  • Legistar

πŸš€ Non-.gov Coverage

Many smaller municipalities use non-.gov domains:

  • .org (e.g., cityofsomewhere.org)
  • .us (e.g., somewhere.ca.us)
  • .net (e.g., districschools.net)

Our URL patterns cover these:

# Pattern generation includes:
patterns = [
    "https://cityname.gov",       # Primary
    "https://cityname.us",        # Alternative
    "https://cityname.org",       # Non-profit
    "https://cityname.net",       # Legacy
]

Future Enhancement:


πŸ’° Cost: $0

All data sources are free and publicly available:

Source Cost Update Frequency
CISA dotgov-data $0 Daily
Census Bureau GID $0 Annual
NCES CCD $0 Annual
Pattern Matching $0 On-demand

Total API costs: $0 πŸŽ‰

Compare to deprecated approach:

  • Google Custom Search API: $5/1000 queries = ~$150
  • Bing Search API: $7/1000 queries = ~$90

Savings: $240+ per discovery run βœ…


πŸ“š References


βœ… Credits

System Architecture: Medallion Architecture (Bronze β†’ Silver β†’ Gold)
Data Engineering Pattern: Delta Lake + PySpark
Sustainable Approach: No deprecated search APIs
Guidance Source: Professional data engineering best practices

Thank you for the excellent guidance on official data sources! πŸ™

This system now uses the exact sources recommended by data engineers to map the U.S. government landscape. 🦷✨