Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

File size: 7,154 Bytes

896453f

# Official Data Sources for Jurisdiction Discovery

This document credits the **official, free, public datasets** used by the Oral Health Policy Pulse jurisdiction discovery system.

---

## 🏛️ Primary Data Sources

### 1. CISA .gov Domain Master List ⭐ **Most Authoritative**

**Source:** Cybersecurity and Infrastructure Security Agency (CISA)  
**URL:** https://github.com/cisagov/dotgov-data  
**File:** `current-full.csv` (updated daily!)

**What It Contains:**
- **15,000+ registered .gov domains**
- Domain Type: City, County, State, Tribal, School District
- Organization names and locations
- Security contacts and registration dates

**Why We Use It:**
> "The most authoritative source for government URLs is CISA. They maintain a daily-updated repository of every registered .gov domain."

**How We Use It:**
```python
# Direct download from GitHub
from discovery.gsa_domains import GSADomainList

gsa = GSADomainList()
domains_df = await gsa.download_domain_list()
```

**Lakehouse Strategy:**
1. Ingest to **Bronze Layer** (`bronze/gov_domains`)
2. Filter by `Domain Type` for targeted scraping (City, County)
3. Use for **exact matching** (confidence: 0.95-1.0)
4. Use for **fuzzy matching** with 75%+ similarity

---

### 2. U.S. Census Bureau - Government Integrated Directory (GID)

**Source:** U.S. Census Bureau, Government Statistics  
**URL:** https://www.census.gov/programs-surveys/gus.html  
**Dataset:** 2022 Census of Governments

**What It Contains:**
- **90,735 total government units**
  - 3,143 counties
  - 19,495 municipalities (cities/towns)
  - 16,504 townships
  - 13,051 school districts
  - 38,542 special districts
- FIPS codes (standardized IDs)
- Population data
- Geographic hierarchy (state, county, place)

**Why We Use It:**
> "The Census Bureau GID provides a list of all 90,000+ legal government units. You can join this against the CISA list to find 'missing' URLs that your agent needs to hunt for."

**How We Use It:**
```python
from discovery.census_ingestion import CensusGovernmentIngestion

census = CensusGovernmentIngestion()
dfs = await census.ingest_all_jurisdictions()
```

**Lakehouse Strategy:**
1. Ingest to **Bronze Layer** (`bronze/jurisdictions/{type}`)
2. Create **unified view** with all jurisdiction types
3. **Join with CISA** to identify missing URLs
4. Prioritize by population for scraping

---

### 3. NCES Common Core of Data (CCD)

**Source:** National Center for Education Statistics (NCES)  
**URL:** https://nces.ed.gov/ccd/  
**Dataset:** Local Education Agency (LEA) Universe Survey

**What It Contains:**
- **13,000+ school districts**
- Official district names and NCES IDs
- Physical addresses and phone numbers
- **Website URLs** (when available)
- Enrollment and demographic data
- District type (Regular, Charter, etc.)

**Why We Use It:**
> "Since one of your goals is tracking school dental screenings, you need a dedicated list of school board domains, as these are often separate from city governments."

**How We Use It:**
```python
from discovery.nces_ingestion import NCESSchoolDistrictIngestion

nces = NCESSchoolDistrictIngestion()
districts_df = await nces.ingest_school_districts()
```

**Lakehouse Strategy:**
1. Ingest to **Bronze Layer** (`bronze/nces_school_districts`)
2. Extract **provided URLs** (many NCES records include website field!)
3. Use district names to **generate URL patterns** for missing sites
4. Common pattern: `{district}.k12.{state}.us`

---

## 📋 Summary Table: Where to Pull the Lists

| Jurisdiction Type | Primary Free Source | Format | Coverage |
|-------------------|---------------------|--------|----------|
| **All Official .gov** | CISA dotgov-data | CSV / GitHub | 15,000+ domains |
| **School Districts** | NCES CCD Data | CSV | 13,000+ districts |
| **Counties/Cities** | Census Bureau GID | CSV | 22,638 jurisdictions |
| **Townships** | Census Bureau GID | CSV | 16,504 townships |
| **Special Districts** | Census Bureau GID | CSV | 38,542 districts |
| **State Legislatures** | LegiScan API | JSON / API | 50 states |

---

## 🔍 Scraping Strategy (Based on Your Guidance)

### Step 1: Ingest
```bash
python main.py init  # Initialize Delta Lake
python main.py discover-jurisdictions --limit 100  # Test run
```

**Pulls:**
- ✅ `current-full.csv` from CISA → Bronze layer
- ✅ Census GID CSVs → Bronze layer  
- ✅ NCES CCD data → Bronze layer

### Step 2: Filter
```python
# Create Silver layer table
df = spark.read.format("delta").load("bronze/gov_domains")

# Filter for local governments
local_govs = df.filter(
    col("Domain Type").isin(["City", "County", "School District"])
)
```

**Result:** ~8,000-10,000 high-priority targets

### Step 3: Crawl
```bash
python main.py scrape-batch --source discovered --limit 50
```

**Points Scrapy agents at discovered URLs:**
- Homepage URLs from CISA + pattern matching
- Verified with HTTP HEAD/GET requests
- Prioritized by population and domain type

### Step 4: Keyword Hunt
**Agent searches for:**
- "Minutes" pages
- "Agendas" pages  
- "Meetings" pages
- "Water" + "Fluoride" content

**CMS Detection:**
- Granicus
- CivicClerk
- Municode
- Legistar

---

## 🚀 Non-.gov Coverage

**Many smaller municipalities use non-.gov domains:**
- `.org` (e.g., `cityofsomewhere.org`)
- `.us` (e.g., `somewhere.ca.us`)
- `.net` (e.g., `districschools.net`)

**Our URL patterns cover these:**
```python
# Pattern generation includes:
patterns = [
    "https://cityname.gov",       # Primary
    "https://cityname.us",        # Alternative
    "https://cityname.org",       # Non-profit
    "https://cityname.net",       # Legacy
]
```

**Future Enhancement:**
- [State and Local Government on the Net](https://www.statelocalgov.net/)
- Could scrape this directory as fallback for missing URLs
- Manually curated list of non-.gov government sites

---

## 💰 Cost: $0

All data sources are **free and publicly available**:

| Source | Cost | Update Frequency |
|--------|------|------------------|
| CISA dotgov-data | **$0** | Daily |
| Census Bureau GID | **$0** | Annual |
| NCES CCD | **$0** | Annual |
| Pattern Matching | **$0** | On-demand |

**Total API costs:** **$0** 🎉

Compare to deprecated approach:
- ~~Google Custom Search API: $5/1000 queries = ~$150~~
- ~~Bing Search API: $7/1000 queries = ~$90~~

**Savings: $240+ per discovery run** ✅

---

## 📚 References

- **CISA .gov Domains:** https://github.com/cisagov/dotgov-data
- **Census Bureau GID:** https://www.census.gov/programs-surveys/gus.html
- **NCES CCD:** https://nces.ed.gov/ccd/
- **State/Local Gov Directory:** https://www.statelocalgov.net/
- **LegiScan API:** https://legiscan.com/legiscan

---

## ✅ Credits

**System Architecture:** Medallion Architecture (Bronze → Silver → Gold)  
**Data Engineering Pattern:** Delta Lake + PySpark  
**Sustainable Approach:** No deprecated search APIs  
**Guidance Source:** Professional data engineering best practices

**Thank you for the excellent guidance on official data sources!** 🙏

This system now uses **the exact sources recommended by data engineers** to map the U.S. government landscape. 🦷✨