Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /JURISDICTION_DISCOVERY_SETUP.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified about 1 month ago

preview code

raw

history blame contribute delete

12.4 kB

	# Jurisdiction Discovery - Quick Start Guide

	## No External APIs Required! 🎉

	This discovery system uses pattern-based matching and public datasets only. No search API keys needed!

	## Quick Start

	### 1. Install Dependencies

	All required packages are in `requirements.txt`:

	```bash
	pip install -r requirements.txt
	```

	Key packages:
	- `httpx` - HTTP client for URL verification
	- `beautifulsoup4` - HTML parsing for web crawling
	- `pyspark` - Data processing
	- `delta-spark` - Delta Lake storage

	### 2. Initialize Delta Lake

	```bash
	python main.py init
	```

	### 3. Run Discovery

	```bash
	# Test with 100 jurisdictions
	python main.py discover-jurisdictions --limit 100

	# Single state
	python main.py discover-jurisdictions --state CA

	# Full discovery (~30k jurisdictions, 12-18 hours)
	python main.py discover-jurisdictions
	```

	### 4. View Results

	```bash
	python main.py discovery-stats
	```

	Expected output:
	```
	📊 Jurisdiction Discovery Statistics

	Bronze Layer (Raw Data):
	Total jurisdictions: 90,735
	- county: 3,143
	- municipality: 19,495
	- school_district: 13,051

	Silver Layer (Discovered URLs):
	Total discoveries: 87
	Homepages found: 78 (89.7%)
	Minutes URLs found: 65 (74.7%)
	Avg confidence: 0.82

	Gold Layer (Scraping Targets):
	Total targets: 65
	High priority: 42
	```

	### 5. Start Scraping

	```bash
	python main.py scrape-batch --source discovered --limit 50
	```

	## How It Works

	### Strategy 1: GSA Domain Matching

	The system directly matches jurisdiction names to the GSA .gov registry:

	```python
	"Sacramento County" → normalized: "sacramento"
	GSA lookup → "sacramento.gov" ✓
	Confidence: 1.0
	```

	### Strategy 2: URL Pattern Generation

	Common government URL patterns are tested:

	Counties:
	- `co.{name}.{state}.us`
	- `{name}county.gov`

	Cities:
	- `www.{name}.gov`
	- `cityof{name}.gov`

	School Districts:
	- `{name}.k12.{state}.us`
	- `{name}schools.org`

	Example:
	```python
	"Fresno" (municipality, CA)
	Test: https://www.fresno.gov → ✓ Found
	Confidence: 0.9
	```

	### Strategy 3: Web Crawling

	Once a homepage is found:
	1. Crawl for "minutes", "agendas" links
	2. Detect CMS platforms (Granicus, CivicClerk, etc.)
	3. Boost confidence for .gov domains

	## Performance

	### Expected Results

	- Counties: 85-95% discovery rate
	- Cities > 10k: 75-90% discovery rate
	- School Districts: 70-85% discovery rate
	- Processing Time: ~3-5 min per 100 jurisdictions
	- Total Cost: $0 (no API fees!)

	### Optimization

	Parallel Processing:
	```bash
	# Process multiple states in parallel
	for state in CA TX NY FL PA; do
	python main.py discover-jurisdictions --state $state &
	done
	wait
	```

	Databricks Notebook:
	For production runs, use the Databricks notebook:
	1. Upload `notebooks/Jurisdiction_Discovery.py`
	2. Create cluster (2-4 workers)
	3. Run with Spark parallel processing

	## Troubleshooting

	### Low Discovery Rate

	Check if URL patterns need adjustment for specific regions:

	```python
	# In discovery/url_discovery_agent.py
	# Add regional patterns, e.g.:
	if state == "MA": # Massachusetts has unique patterns
	patterns.extend([
	(f"https://www.{name_slug}.ma.us", 0.85),
	])
	```

	### Memory Errors

	Process in smaller batches:

	```bash
	# By state
	python main.py discover-jurisdictions --state CA

	# Or by type
	python main.py discover-jurisdictions --type county
	```

	### Census Download Fails

	Cached for 7 days by default. For manual download:

	1. Download from: https://www.census.gov/programs-surveys/gus.html
	2. Place in `data/cache/census/`
	3. Rerun discovery

	## Next Steps

	1. Test Discovery: Run with `--limit 100`
	2. Review Results: Check `discovery-stats`
	3. Full Run: Remove limit for production
	4. Start Scraping: Use discovered URLs
	5. Schedule Re-Discovery: Monthly updates

	## Cost

	Total: $0 🎉

	- No API fees
	- Uses free public datasets
	- Only local/cloud compute costs

	Compare to legacy approach:
	- ~~Google Search API: $150~~
	- ~~Bing Search API: $90~~
	- Pattern Matching: $0

	---

	Ready to discover 90,000+ government websites with zero external dependencies! 🚀
	# Jurisdiction Discovery System - Setup Guide

	## Quick Start

	### 1. Configure Search APIs

	The discovery system requires search API keys to find government websites. You can use either Google Custom Search or Bing Search (or both for redundancy).

	#### Option A: Google Custom Search API

	1. Enable the API
	- Visit [Google Cloud Console](https://console.cloud.google.com/)
	- Create a new project or select existing
	- Enable "Custom Search API"

	2. Create API Key
	- Go to "Credentials" → "Create Credentials" → "API Key"
	- Copy your API key

	3. Create Search Engine
	- Visit [Google Custom Search](https://cse.google.com/cse/all)
	- Click "Add" to create new search engine
	- Set "Sites to search" to: `*.gov` (to focus on government sites)
	- Copy your "Search Engine ID"

	4. Add to .env
	```bash
	GOOGLE_SEARCH_API_KEY=your_google_api_key
	GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
	```

	Pricing: First 100 queries/day free, then $5 per 1,000 queries

	#### Option B: Bing Search API

	1. Create Azure Account
	- Visit [Azure Portal](https://portal.azure.com/)
	- Create account (free tier available)

	2. Create Bing Search Resource
	- Click "Create a resource" → Search for "Bing Search v7"
	- Select pricing tier (F1 free tier: 1k queries/month)
	- Create resource

	3. Get API Key
	- Go to your Bing Search resource
	- Click "Keys and Endpoint"
	- Copy one of the keys

	4. Add to .env
	```bash
	BING_SEARCH_API_KEY=your_bing_api_key
	```

	Pricing: Free tier: 1,000 queries/month; Paid: $3 per 1,000 queries

	### 2. Install Dependencies

	All required packages are already in `requirements.txt`:

	```bash
	pip install -r requirements.txt
	```

	Key packages for discovery:
	- `httpx==0.27.0` - Async HTTP client
	- `beautifulsoup4==4.12.2` - HTML parsing
	- `pyspark==3.5.0` - Data processing
	- `delta-spark==3.0.0` - Delta Lake

	### 3. Initialize Delta Lake

	```bash
	python main.py init
	```

	This creates the necessary Delta Lake tables.

	### 4. Run Discovery Pipeline

	#### Test Run (100 jurisdictions)

	```bash
	python main.py discover-jurisdictions --limit 100
	```

	Expected output:
	```
	📊 Bronze Layer Complete:
	Total records: 90,735
	Counties: 3,143
	Municipalities: 19,495
	...

	📊 URL Discovery Complete:
	Attempted: 100
	Successful: 87
	Homepages found: 87
	Minutes URLs found: 65
	Avg confidence: 0.72

	📊 Gold Layer Complete:
	Scraping targets created: 65
	High priority (>150): 42
	...

	✅ Discovery Complete!
	```

	#### State-Specific Discovery

	```bash
	python main.py discover-jurisdictions --state CA
	```

	#### Full Production Run

	```bash
	# Discovers all ~30,000 high-priority jurisdictions
	# Takes 4-6 hours with parallel processing
	python main.py discover-jurisdictions
	```

	### 5. View Statistics

	```bash
	python main.py discovery-stats
	```

	Output:
	```
	📊 Jurisdiction Discovery Statistics

	Bronze Layer (Raw Data):
	Total jurisdictions: 90,735
	- county: 3,143
	- municipality: 19,495
	- school_district: 13,051
	- special_district: 38,542
	- township: 16,504

	Silver Layer (Discovered URLs):
	Total discoveries: 27,483
	Homepages found: 24,125 (87.8%)
	Minutes URLs found: 18,562 (67.5%)
	Avg confidence: 0.74

	Gold Layer (Scraping Targets):
	Total targets: 18,562
	High priority: 12,340
	- pending: 18,562
	```

	### 6. Start Scraping

	```bash
	# Scrape high-priority targets
	python main.py scrape-batch --source discovered --limit 50 --priority 150

	# Or scrape all pending targets (use with caution!)
	python main.py scrape-batch --source discovered --limit 1000
	```

	## Using Databricks Notebook

	For production deployment on Databricks:

	1. Upload Notebook
	```bash
	databricks workspace import notebooks/Jurisdiction_Discovery.py \
	-l PYTHON \
	-f SOURCE \
	/Users/your-email@company.com/Jurisdiction_Discovery
	```

	2. Configure Secrets
	```bash
	# Create secret scope
	databricks secrets create-scope oral-health-app

	# Add API keys
	databricks secrets put-secret oral-health-app google-search-api-key
	databricks secrets put-secret oral-health-app google-search-engine-id
	databricks secrets put-secret oral-health-app bing-search-api-key
	```

	3. Create Cluster
	- Runtime: 14.3 LTS or higher
	- Node type: Standard_DS3_v2 (or similar)
	- Workers: 2-4 (for parallel processing)
	- Libraries: All from `requirements.txt`

	4. Run Notebook
	- Open notebook in Databricks workspace
	- Attach to cluster
	- Run all cells

	## Cost Estimation

	### API Costs

	For discovering 30,000 jurisdictions:

	\| Provider \| Free Tier \| Paid Cost \| Total Cost \|
	\|----------\|-----------\|-----------\|------------\|
	\| Google \| 100/day (3,000/month) \| $5/1k \| ~$135 \|
	\| Bing \| 1,000/month \| $3/1k \| ~$87 \|
	\| Both \| 4,000 free \| Rest on Bing \| ~$78 \|

	Recommendation: Use both APIs to maximize free tier usage.

	### Compute Costs

	Local Development:
	- Free (uses local resources)
	- ~4-6 hours for full discovery

	Databricks:
	- Cluster: ~$2-4/hour
	- Total: ~$8-24 for full discovery
	- Can use spot instances to reduce cost

	### Re-discovery Schedule

	- Monthly: Catch URL changes and new jurisdictions
	- Cost: ~$10-20/month (many URLs cached)

	## Troubleshooting

	### Low Discovery Rate

	Problem: Only finding 30-40% of URLs

	Solutions:
	1. Check API keys are correct
	2. Verify API quotas not exceeded
	3. Review failed discoveries:
	```python
	from pyspark.sql.functions import col
	silver_df = spark.read.format("delta").load("silver/discovered_urls")
	failed = silver_df.filter(col("homepage_url").isNull())
	failed.show(20, truncate=False)
	```

	### Memory Errors

	Problem: Out of memory during discovery

	Solutions:
	1. Process by state:
	```bash
	for state in CA TX NY FL PA OH IL MI NC GA; do
	python main.py discover-jurisdictions --state $state
	done
	```

	2. Increase Spark memory:
	```python
	spark = SparkSession.builder \
	.config("spark.driver.memory", "8g") \
	.config("spark.executor.memory", "8g") \
	.getOrCreate()
	```

	3. Use Databricks cluster (more memory available)

	### API Rate Limits

	Problem: Hitting rate limits too quickly

	Solutions:
	1. Reduce batch size in `url_discovery_agent.py`:
	```python
	batch_size = 5 # Instead of 10
	```

	2. Add delays between batches:
	```python
	await asyncio.sleep(1) # After each batch
	```

	3. Use both Google and Bing to distribute load

	### Census Data Download Fails

	Problem: Census Bureau site unreachable

	Solutions:
	1. Use cached data (automatically cached for 7 days)
	2. Manual download:
	```bash
	# Download files manually from Census Bureau
	# Place in data/cache/census/
	```

	3. Check Census Bureau site status: https://www.census.gov/programs-surveys/gus.html

	## Monitoring Progress

	### Check Discovery Status

	```sql
	-- In Databricks SQL or Spark
	SELECT
	state,
	COUNT(*) as total,
	COUNT(homepage_url) as found,
	ROUND(COUNT(homepage_url) * 100.0 / COUNT(*), 1) as success_rate
	FROM silver.discovered_urls
	GROUP BY state
	ORDER BY success_rate DESC;
	```

	### Track Scraping Progress

	```sql
	SELECT
	scraping_status,
	COUNT(*) as count,
	ROUND(COUNT() 100.0 / (SELECT COUNT(*) FROM gold.scraping_targets), 1) as pct
	FROM gold.scraping_targets
	GROUP BY scraping_status;
	```

	## Next Steps

	Once discovery is complete:

	1. Review High-Priority Targets
	- Check for false positives
	- Validate CMS platform detection

	2. Start Scraping
	- Begin with top 100 high-priority sites
	- Monitor document quality
	- Adjust priority scores as needed

	3. Schedule Automation
	- Set up monthly re-discovery job
	- Monitor for new jurisdictions
	- Track URL changes

	4. Integration
	- Connect to existing scraper agents
	- Feed documents to classification pipeline
	- Generate advocacy opportunities

	## Support

	For issues or questions:
	- GitHub Issues: [github.com/getcommunityone/open-navigator-for-engagement/issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues)
	- Documentation: [JURISDICTION_DISCOVERY.md](JURISDICTION_DISCOVERY.md)

	---

	Ready to discover 90,000+ government websites! 🦷✨