Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # Official Data Sources for Jurisdiction Discovery | |
| This document credits the **official, free, public datasets** used by the Oral Health Policy Pulse jurisdiction discovery system. | |
| --- | |
| ## ποΈ Primary Data Sources | |
| ### 1. CISA .gov Domain Master List β **Most Authoritative** | |
| **Source:** Cybersecurity and Infrastructure Security Agency (CISA) | |
| **URL:** https://github.com/cisagov/dotgov-data | |
| **File:** `current-full.csv` (updated daily!) | |
| **What It Contains:** | |
| - **15,000+ registered .gov domains** | |
| - Domain Type: City, County, State, Tribal, School District | |
| - Organization names and locations | |
| - Security contacts and registration dates | |
| **Why We Use It:** | |
| > "The most authoritative source for government URLs is CISA. They maintain a daily-updated repository of every registered .gov domain." | |
| **How We Use It:** | |
| ```python | |
| # Direct download from GitHub | |
| from discovery.gsa_domains import GSADomainList | |
| gsa = GSADomainList() | |
| domains_df = await gsa.download_domain_list() | |
| ``` | |
| **Lakehouse Strategy:** | |
| 1. Ingest to **Bronze Layer** (`bronze/gov_domains`) | |
| 2. Filter by `Domain Type` for targeted scraping (City, County) | |
| 3. Use for **exact matching** (confidence: 0.95-1.0) | |
| 4. Use for **fuzzy matching** with 75%+ similarity | |
| --- | |
| ### 2. U.S. Census Bureau - Government Integrated Directory (GID) | |
| **Source:** U.S. Census Bureau, Government Statistics | |
| **URL:** https://www.census.gov/programs-surveys/gus.html | |
| **Dataset:** 2022 Census of Governments | |
| **What It Contains:** | |
| - **90,735 total government units** | |
| - 3,143 counties | |
| - 19,495 municipalities (cities/towns) | |
| - 16,504 townships | |
| - 13,051 school districts | |
| - 38,542 special districts | |
| - FIPS codes (standardized IDs) | |
| - Population data | |
| - Geographic hierarchy (state, county, place) | |
| **Why We Use It:** | |
| > "The Census Bureau GID provides a list of all 90,000+ legal government units. You can join this against the CISA list to find 'missing' URLs that your agent needs to hunt for." | |
| **How We Use It:** | |
| ```python | |
| from discovery.census_ingestion import CensusGovernmentIngestion | |
| census = CensusGovernmentIngestion() | |
| dfs = await census.ingest_all_jurisdictions() | |
| ``` | |
| **Lakehouse Strategy:** | |
| 1. Ingest to **Bronze Layer** (`bronze/jurisdictions/{type}`) | |
| 2. Create **unified view** with all jurisdiction types | |
| 3. **Join with CISA** to identify missing URLs | |
| 4. Prioritize by population for scraping | |
| --- | |
| ### 3. NCES Common Core of Data (CCD) | |
| **Source:** National Center for Education Statistics (NCES) | |
| **URL:** https://nces.ed.gov/ccd/ | |
| **Dataset:** Local Education Agency (LEA) Universe Survey | |
| **What It Contains:** | |
| - **13,000+ school districts** | |
| - Official district names and NCES IDs | |
| - Physical addresses and phone numbers | |
| - **Website URLs** (when available) | |
| - Enrollment and demographic data | |
| - District type (Regular, Charter, etc.) | |
| **Why We Use It:** | |
| > "Since one of your goals is tracking school dental screenings, you need a dedicated list of school board domains, as these are often separate from city governments." | |
| **How We Use It:** | |
| ```python | |
| from discovery.nces_ingestion import NCESSchoolDistrictIngestion | |
| nces = NCESSchoolDistrictIngestion() | |
| districts_df = await nces.ingest_school_districts() | |
| ``` | |
| **Lakehouse Strategy:** | |
| 1. Ingest to **Bronze Layer** (`bronze/nces_school_districts`) | |
| 2. Extract **provided URLs** (many NCES records include website field!) | |
| 3. Use district names to **generate URL patterns** for missing sites | |
| 4. Common pattern: `{district}.k12.{state}.us` | |
| --- | |
| ## π Summary Table: Where to Pull the Lists | |
| | Jurisdiction Type | Primary Free Source | Format | Coverage | | |
| |-------------------|---------------------|--------|----------| | |
| | **All Official .gov** | CISA dotgov-data | CSV / GitHub | 15,000+ domains | | |
| | **School Districts** | NCES CCD Data | CSV | 13,000+ districts | | |
| | **Counties/Cities** | Census Bureau GID | CSV | 22,638 jurisdictions | | |
| | **Townships** | Census Bureau GID | CSV | 16,504 townships | | |
| | **Special Districts** | Census Bureau GID | CSV | 38,542 districts | | |
| | **State Legislatures** | LegiScan API | JSON / API | 50 states | | |
| --- | |
| ## π Scraping Strategy (Based on Your Guidance) | |
| ### Step 1: Ingest | |
| ```bash | |
| python main.py init # Initialize Delta Lake | |
| python main.py discover-jurisdictions --limit 100 # Test run | |
| ``` | |
| **Pulls:** | |
| - β `current-full.csv` from CISA β Bronze layer | |
| - β Census GID CSVs β Bronze layer | |
| - β NCES CCD data β Bronze layer | |
| ### Step 2: Filter | |
| ```python | |
| # Create Silver layer table | |
| df = spark.read.format("delta").load("bronze/gov_domains") | |
| # Filter for local governments | |
| local_govs = df.filter( | |
| col("Domain Type").isin(["City", "County", "School District"]) | |
| ) | |
| ``` | |
| **Result:** ~8,000-10,000 high-priority targets | |
| ### Step 3: Crawl | |
| ```bash | |
| python main.py scrape-batch --source discovered --limit 50 | |
| ``` | |
| **Points Scrapy agents at discovered URLs:** | |
| - Homepage URLs from CISA + pattern matching | |
| - Verified with HTTP HEAD/GET requests | |
| - Prioritized by population and domain type | |
| ### Step 4: Keyword Hunt | |
| **Agent searches for:** | |
| - "Minutes" pages | |
| - "Agendas" pages | |
| - "Meetings" pages | |
| - "Water" + "Fluoride" content | |
| **CMS Detection:** | |
| - Granicus | |
| - CivicClerk | |
| - Municode | |
| - Legistar | |
| --- | |
| ## π Non-.gov Coverage | |
| **Many smaller municipalities use non-.gov domains:** | |
| - `.org` (e.g., `cityofsomewhere.org`) | |
| - `.us` (e.g., `somewhere.ca.us`) | |
| - `.net` (e.g., `districschools.net`) | |
| **Our URL patterns cover these:** | |
| ```python | |
| # Pattern generation includes: | |
| patterns = [ | |
| "https://cityname.gov", # Primary | |
| "https://cityname.us", # Alternative | |
| "https://cityname.org", # Non-profit | |
| "https://cityname.net", # Legacy | |
| ] | |
| ``` | |
| **Future Enhancement:** | |
| - [State and Local Government on the Net](https://www.statelocalgov.net/) | |
| - Could scrape this directory as fallback for missing URLs | |
| - Manually curated list of non-.gov government sites | |
| --- | |
| ## π° Cost: $0 | |
| All data sources are **free and publicly available**: | |
| | Source | Cost | Update Frequency | | |
| |--------|------|------------------| | |
| | CISA dotgov-data | **$0** | Daily | | |
| | Census Bureau GID | **$0** | Annual | | |
| | NCES CCD | **$0** | Annual | | |
| | Pattern Matching | **$0** | On-demand | | |
| **Total API costs:** **$0** π | |
| Compare to deprecated approach: | |
| - ~~Google Custom Search API: $5/1000 queries = ~$150~~ | |
| - ~~Bing Search API: $7/1000 queries = ~$90~~ | |
| **Savings: $240+ per discovery run** β | |
| --- | |
| ## π References | |
| - **CISA .gov Domains:** https://github.com/cisagov/dotgov-data | |
| - **Census Bureau GID:** https://www.census.gov/programs-surveys/gus.html | |
| - **NCES CCD:** https://nces.ed.gov/ccd/ | |
| - **State/Local Gov Directory:** https://www.statelocalgov.net/ | |
| - **LegiScan API:** https://legiscan.com/legiscan | |
| --- | |
| ## β Credits | |
| **System Architecture:** Medallion Architecture (Bronze β Silver β Gold) | |
| **Data Engineering Pattern:** Delta Lake + PySpark | |
| **Sustainable Approach:** No deprecated search APIs | |
| **Guidance Source:** Professional data engineering best practices | |
| **Thank you for the excellent guidance on official data sources!** π | |
| This system now uses **the exact sources recommended by data engineers** to map the U.S. government landscape. π¦·β¨ | |