Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /JURISDICTION_DISCOVERY.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

19.4 kB

	# Jurisdiction Discovery System

	## Overview

	The Jurisdiction Discovery System automatically identifies and tracks over 90,000 local government units across the United States, discovering their official websites and meeting minutes URLs using pattern-based matching and public datasets.

	## ✅ Sustainable Approach (No Deprecated APIs)

	This system uses vendor-neutral, production-ready methods:

	✅ Pattern Matching - Generate URLs from jurisdiction names using common government patterns
	✅ GSA .gov Registry - Direct matching with authoritative domain list
	✅ Web Crawling - Verify URLs and discover minutes pages
	✅ Public Datasets - Census Bureau + GSA official data

	Does NOT use:
	❌ Google Custom Search API (deprecated for production)
	❌ Bing Search API (legacy, not recommended)

	Benefits:
	- 🆓 Zero API costs - No search API fees
	- 🔒 Reliable - No rate limits or quotas
	- ♻️ Sustainable - Vendor-neutral, future-proof
	- 📊 Reproducible - Deterministic pattern matching

	## Architecture

	### Discovery Strategy

	```
	┌─────────────────────────────────────────────────────────┐
	│ BRONZE LAYER (Public Datasets) │
	├─────────────────────────────────────────────────────────┤
	│ Census Bureau GID: 90,735 jurisdictions │
	│ GSA .gov Registry: 12,000+ validated domains │
	└─────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────┐
	│ SILVER LAYER (Pattern-Based Discovery) │
	├─────────────────────────────────────────────────────────┤
	│ │
	│ Strategy 1: GSA Domain Matching │
	│ • Direct lookup: "Sacramento County" → sacramento.gov │
	│ • Fuzzy matching with 75%+ similarity │
	│ • Confidence: 0.95-1.0 │
	│ │
	│ Strategy 2: URL Pattern Generation │
	│ • Counties: co.{name}.{state}.us, {name}county.gov │
	│ • Cities: www.{name}.gov, cityof{name}.gov │
	│ • Schools: {name}.k12.{state}.us, {name}schools.org │
	│ • Verify accessibility with HTTP HEAD/GET │
	│ • Confidence: 0.6-0.9 │
	│ │
	│ Strategy 3: Web Crawling │
	│ • Find "minutes" or "agendas" links on homepage │
	│ • Detect CMS platforms (Granicus, CivicClerk, etc.) │
	│ • Confidence boost for .gov domains │
	│ │
	└─────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────┐
	│ GOLD LAYER (Scraping Targets) │
	├─────────────────────────────────────────────────────────┤
	│ • Confidence score > 0.6 │
	│ • Has minutes URL or known CMS │
	│ • Prioritized by population & domain quality │
	│ • Ready for scraper agents │
	└─────────────────────────────────────────────────────────┘
	```

	## Usage

	```bash
	# Run discovery (no API keys needed!)
	python main.py discover-jurisdictions --limit 100

	# View statistics
	python main.py discovery-stats

	# Start scraping discovered sites
	python main.py scrape-batch --source discovered
	```

	## Performance

	### Expected Discovery Rates

	\| Jurisdiction Type \| Success Rate \| Notes \|
	\|-------------------\|--------------\|-------\|
	\| Counties \| 85-95% \| Best coverage (official .gov domains) \|
	\| Cities > 10k pop \| 75-90% \| Good pattern matching \|
	\| School Districts \| 70-85% \| Consistent naming conventions \|
	\| Townships \| 50-65% \| More variation in URLs \|

	### Benchmarks

	- 100 jurisdictions: ~3-5 minutes
	- 30,000 jurisdictions: ~12-18 hours
	- Total cost: $0 (no API fees)

	## References

	- Census Bureau GID: https://www.census.gov/programs-surveys/gus.html
	- GSA .gov Domains: https://github.com/cisagov/dotgov-data
	- Government URL Best Practices: https://digital.gov/topics/url-management/

	For detailed documentation, see [JURISDICTION_DISCOVERY_SETUP.md](JURISDICTION_DISCOVERY_SETUP.md)

	---

	Production-ready with zero external API dependencies! 🦷✨
	# Jurisdiction Discovery System

	## Overview

	The Jurisdiction Discovery System automatically identifies and tracks over 90,000 local government units across the United States, discovering their official websites and meeting minutes URLs.

	## Architecture

	### Medallion Pipeline

	```
	┌─────────────────────────────────────────────────────────┐
	│ BRONZE LAYER │
	│ (Raw Data Ingestion) │
	├─────────────────────────────────────────────────────────┤
	│ │
	│ Census Bureau Data: │
	│ • 3,143 counties │
	│ • 19,495 municipalities │
	│ • 16,504 townships │
	│ • 13,051 school districts │
	│ • 38,542 special districts │
	│ ──────────────── │
	│ Total: ~90,735 jurisdictions │
	│ │
	│ GSA .gov Domain List: │
	│ • 12,000+ validated .gov domains │
	│ • Domain type classification │
	│ • Organization mapping │
	│ │
	└─────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────┐
	│ SILVER LAYER │
	│ (URL Discovery & Validation) │
	├─────────────────────────────────────────────────────────┤
	│ │
	│ For each jurisdiction: │
	│ 1. Search for official website (Google/Bing API) │
	│ 2. Validate against .gov domain list │
	│ 3. Crawl homepage for "minutes" links │
	│ 4. Detect CMS platform (Granicus, CivicClerk, etc.) │
	│ 5. Assign confidence score │
	│ │
	│ Output: Discovered URLs with metadata │
	│ │
	└─────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────┐
	│ GOLD LAYER │
	│ (Scraping Targets) │
	├─────────────────────────────────────────────────────────┤
	│ │
	│ Filtered & Prioritized: │
	│ • Has minutes URL │
	│ • Confidence score > 0.6 │
	│ • Preferably .gov domain │
	│ • Population-weighted priority │
	│ │
	│ Ready for: Scraper Agent │
	│ │
	└─────────────────────────────────────────────────────────┘
	```

	## Data Sources

	### 1. U.S. Census Bureau - Government Integrated Directory (GID)

	URL: https://www.census.gov/programs-surveys/gus.html

	What it provides:
	- Complete list of all local governments
	- FIPS codes (Federal Information Processing Standards)
	- Population data
	- Functional status
	- Geographic hierarchy

	Usage:
	```python
	from discovery.census_ingestion import CensusGovernmentIngestion

	census = CensusGovernmentIngestion()
	dataframes = await census.ingest_all_jurisdictions()
	```

	### 2. GSA .gov Domain List

	URL: https://github.com/cisagov/dotgov-data

	What it provides:
	- All registered .gov domains (~12,000+)
	- Domain type (Federal, State, County, City, etc.)
	- Organization name and location
	- Security contact information

	Usage:
	```python
	from discovery.gsa_domains import GSADomainList

	gsa = GSADomainList()
	csv_path = await gsa.download_domain_list()
	domains_df = gsa.parse_domains(csv_path)
	```

	### 3. Search API Integration

	Supported:
	- Google Custom Search API
	- Bing Search API

	Configuration:
	```bash
	# .env
	GOOGLE_SEARCH_API_KEY=your_key_here
	GOOGLE_SEARCH_ENGINE_ID=your_engine_id
	BING_SEARCH_API_KEY=your_key_here
	```

	## Usage

	### Quick Start

	```bash
	# Run complete discovery pipeline
	python -m discovery.discovery_pipeline

	# Or use CLI
	python main.py discover-jurisdictions --limit 100
	```

	### Step-by-Step

	```python
	from discovery.discovery_pipeline import DiscoveryPipeline

	pipeline = DiscoveryPipeline()

	# 1. Bronze Layer: Ingest raw data
	await pipeline.run_bronze_ingestion()

	# 2. Silver Layer: Discover URLs
	await pipeline.run_url_discovery(limit=100) # limit for testing

	# 3. Gold Layer: Create scraping targets
	pipeline.create_scraping_targets()

	# Or run all at once
	await pipeline.run_full_pipeline(discovery_limit=100)
	```

	## Output Tables

	### Bronze Layer

	```
	bronze/jurisdictions/
	├── counties/ # 3,143 U.S. counties
	├── municipalities/ # 19,495 cities/towns
	├── townships/ # 16,504 townships
	├── school_districts/ # 13,051 school districts
	├── special_districts/ # 38,542 special districts
	└── unified/ # Combined view

	bronze/gov_domains/ # 12,000+ .gov domains
	```

	### Silver Layer

	```
	silver/discovered_urls/ # URLs with metadata
	Columns:
	- jurisdiction_id (FIPS code)
	- jurisdiction_name
	- state
	- homepage_url
	- minutes_url
	- cms_platform
	- is_gov_domain
	- confidence_score
	- discovery_method
	- last_verified
	```

	### Gold Layer

	```
	gold/scraping_targets/ # Ready for scraping
	Columns:
	- jurisdiction_id
	- jurisdiction_name
	- jurisdiction_type
	- state
	- population
	- homepage_url
	- minutes_url
	- cms_platform
	- priority_score
	- scraping_status
	- last_scraped
	- documents_found
	```

	## URL Discovery Agent

	The `URLDiscoveryAgent` performs intelligent website discovery:

	### Discovery Strategy

	1. Search Phase
	```python
	query = f"{jurisdiction_name} {state} {type} official website"
	urls = await search_google(query)
	```

	2. Validation Phase
	```python
	is_valid = validate_against_gsa_domains(url)
	```

	3. Crawling Phase
	```python
	minutes_url = await crawl_for_minutes(homepage)
	```

	4. CMS Detection
	```python
	cms = detect_cms_platform(url, html)
	# Detects: Granicus, CivicClerk, Municode, Laserfiche, etc.
	```

	### Confidence Scoring

	```python
	confidence = (
	1.0 if is_gov_domain else 0.5 +
	0.3 if has_minutes_url else 0.0 +
	0.2 if cms_detected else 0.0
	)
	```

	## CMS Platform Detection

	The system automatically detects major government CMS platforms:

	\| CMS Platform \| Signature URLs \| Estimated Usage \|
	\|--------------\|----------------\|-----------------\|
	\| Granicus/Legistar \| granicus.com, legistar.com \| 4,000+ cities \|
	\| CivicClerk \| civicclerk.com, civicweb.net \| 2,500+ cities \|
	\| Municode \| municode.com \| 3,500+ cities \|
	\| Laserfiche \| laserfiche.com \| 1,200+ cities \|
	\| PrimeGov \| primegov.com \| 800+ cities \|

	## Prioritization Strategy

	Jurisdictions are prioritized based on:

	```python
	priority_score = (
	100 if is_gov_domain else 50 +
	50 if cms_platform_detected else 0 +
	int(confidence_score * 100) +
	population_weight
	)
	```

	### Focus Areas

	High Priority:
	- All counties (public health authority)
	- Cities > 10,000 population (water fluoridation)
	- All school districts (dental screening programs)

	Medium Priority:
	- Cities 5,000-10,000 population
	- Special districts with health/water authority

	Low Priority:
	- Townships < 5,000 population
	- Non-health special districts

	## API Requirements

	### Google Custom Search API

	Setup:
	1. Enable Custom Search API: https://console.cloud.google.com/
	2. Create Search Engine: https://cse.google.com/cse/all
	3. Get API Key and Engine ID

	Pricing: $5 per 1,000 queries (first 100/day free)

	### Bing Search API

	Setup:
	1. Create Azure account: https://azure.microsoft.com/
	2. Create Bing Search resource
	3. Get API key from Azure Portal

	Pricing: $3 per 1,000 queries

	### Cost Estimation

	For 30,000 jurisdictions:
	- Google: $1,500 (30k queries @ $5/1k)
	- Bing: $900 (30k queries @ $3/1k)

	Tip: Mix free tier + one paid service to reduce costs.

	## Performance

	### Benchmarks

	- Census data download: ~30 seconds
	- GSA domain list download: ~5 seconds
	- URL discovery per jurisdiction: ~2-3 seconds
	- Full 100 jurisdiction discovery: ~5 minutes
	- Full 30,000 jurisdiction discovery: ~20-25 hours

	### Optimization

	Parallel Processing:
	```python
	batch_size = 10 # Process 10 jurisdictions simultaneously
	# Reduces 30,000 jurisdiction discovery to ~4-5 hours
	```

	Caching:
	- Census data: Cached for 7 days
	- GSA domains: Cached for 1 day
	- Search results: No cache (URLs change)

	## Integration with Scraping

	Once discovery completes, scraping targets are ready:

	```python
	# Load scraping targets
	targets_df = spark.read.format("delta").load("gold/scraping_targets")

	# Filter by priority
	high_priority = targets_df.filter(col("priority_score") > 150)

	# Pass to scraper agent
	for target in high_priority.collect():
	await scraper_agent.scrape(
	url=target.minutes_url,
	jurisdiction_id=target.jurisdiction_id,
	cms_platform=target.cms_platform
	)
	```

	## Monitoring

	### Track Discovery Progress

	```sql
	-- Discovery success rate
	SELECT
	COUNT(*) as total,
	COUNT(homepage_url) as homepages_found,
	COUNT(minutes_url) as minutes_found,
	AVG(confidence_score) as avg_confidence
	FROM silver.discovered_urls;

	-- By state
	SELECT
	state,
	COUNT(*) as total,
	COUNT(minutes_url) as with_minutes,
	ROUND(COUNT(minutes_url) * 100.0 / COUNT(*), 1) as success_rate
	FROM silver.discovered_urls
	GROUP BY state
	ORDER BY success_rate DESC;
	```

	### Track Scraping Status

	```sql
	-- Scraping progress
	SELECT
	scraping_status,
	COUNT(*) as count
	FROM gold.scraping_targets
	GROUP BY scraping_status;

	-- Documents found
	SELECT
	jurisdiction_type,
	SUM(documents_found) as total_docs
	FROM gold.scraping_targets
	GROUP BY jurisdiction_type;
	```

	## Troubleshooting

	### Issue: Low discovery rate

	Solution: Check search API keys and quotas

	```bash
	# Test Google API
	curl "https://www.googleapis.com/customsearch/v1?key=YOUR_KEY&cx=YOUR_CX&q=test"

	# Test Bing API
	curl -H "Ocp-Apim-Subscription-Key: YOUR_KEY" \
	"https://api.bing.microsoft.com/v7.0/search?q=test"
	```

	### Issue: Census download fails

	Solution: Use cached data or download manually

	```python
	# Manual download
	from discovery.census_ingestion import CensusGovernmentIngestion
	census = CensusGovernmentIngestion()

	# Download specific type
	csv_path = await census.download_census_data("municipalities")
	```

	### Issue: Memory errors with large datasets

	Solution: Increase Spark memory or process in batches

	```python
	# Increase memory
	spark = SparkSession.builder \
	.config("spark.driver.memory", "8g") \
	.config("spark.executor.memory", "8g") \
	.getOrCreate()

	# Or process in state-by-state batches
	for state in ["CA", "TX", "NY", ...]:
	df = jurisdictions_df.filter(col("state") == state)
	await discover_batch(df)
	```

	## Next Steps

	1. Run initial discovery
	```bash
	python main.py discover-jurisdictions --limit 1000
	```

	2. Review results
	```bash
	python main.py show-discovery-stats
	```

	3. Start scraping
	```bash
	python main.py scrape --source discovered
	```

	## References

	- Census Bureau GID: https://www.census.gov/programs-surveys/gus.html
	- GSA .gov Domains: https://github.com/cisagov/dotgov-data
	- Google Custom Search: https://developers.google.com/custom-search
	- Bing Search API: https://www.microsoft.com/en-us/bing/apis/bing-web-search-api

	---

	Your jurisdiction discovery system is now ready to identify tens of thousands of local governments for oral health policy monitoring! 🦷✨