open-navigator / docs /JURISDICTION_DISCOVERY_SETUP.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# Jurisdiction Discovery - Quick Start Guide
## No External APIs Required! 🎉
This discovery system uses **pattern-based matching** and **public datasets** only. No search API keys needed!
## Quick Start
### 1. Install Dependencies
All required packages are in `requirements.txt`:
```bash
pip install -r requirements.txt
```
Key packages:
- `httpx` - HTTP client for URL verification
- `beautifulsoup4` - HTML parsing for web crawling
- `pyspark` - Data processing
- `delta-spark` - Delta Lake storage
### 2. Initialize Delta Lake
```bash
python main.py init
```
### 3. Run Discovery
```bash
# Test with 100 jurisdictions
python main.py discover-jurisdictions --limit 100
# Single state
python main.py discover-jurisdictions --state CA
# Full discovery (~30k jurisdictions, 12-18 hours)
python main.py discover-jurisdictions
```
### 4. View Results
```bash
python main.py discovery-stats
```
Expected output:
```
📊 Jurisdiction Discovery Statistics
Bronze Layer (Raw Data):
Total jurisdictions: 90,735
- county: 3,143
- municipality: 19,495
- school_district: 13,051
Silver Layer (Discovered URLs):
Total discoveries: 87
Homepages found: 78 (89.7%)
Minutes URLs found: 65 (74.7%)
Avg confidence: 0.82
Gold Layer (Scraping Targets):
Total targets: 65
High priority: 42
```
### 5. Start Scraping
```bash
python main.py scrape-batch --source discovered --limit 50
```
## How It Works
### Strategy 1: GSA Domain Matching
The system directly matches jurisdiction names to the GSA .gov registry:
```python
"Sacramento County" → normalized: "sacramento"
GSA lookup → "sacramento.gov" ✓
Confidence: 1.0
```
### Strategy 2: URL Pattern Generation
Common government URL patterns are tested:
**Counties:**
- `co.{name}.{state}.us`
- `{name}county.gov`
**Cities:**
- `www.{name}.gov`
- `cityof{name}.gov`
**School Districts:**
- `{name}.k12.{state}.us`
- `{name}schools.org`
**Example:**
```python
"Fresno" (municipality, CA)
Test: https://www.fresno.gov → ✓ Found
Confidence: 0.9
```
### Strategy 3: Web Crawling
Once a homepage is found:
1. Crawl for "minutes", "agendas" links
2. Detect CMS platforms (Granicus, CivicClerk, etc.)
3. Boost confidence for .gov domains
## Performance
### Expected Results
- **Counties**: 85-95% discovery rate
- **Cities > 10k**: 75-90% discovery rate
- **School Districts**: 70-85% discovery rate
- **Processing Time**: ~3-5 min per 100 jurisdictions
- **Total Cost**: $0 (no API fees!)
### Optimization
**Parallel Processing:**
```bash
# Process multiple states in parallel
for state in CA TX NY FL PA; do
python main.py discover-jurisdictions --state $state &
done
wait
```
**Databricks Notebook:**
For production runs, use the Databricks notebook:
1. Upload `notebooks/Jurisdiction_Discovery.py`
2. Create cluster (2-4 workers)
3. Run with Spark parallel processing
## Troubleshooting
### Low Discovery Rate
Check if URL patterns need adjustment for specific regions:
```python
# In discovery/url_discovery_agent.py
# Add regional patterns, e.g.:
if state == "MA": # Massachusetts has unique patterns
patterns.extend([
(f"https://www.{name_slug}.ma.us", 0.85),
])
```
### Memory Errors
Process in smaller batches:
```bash
# By state
python main.py discover-jurisdictions --state CA
# Or by type
python main.py discover-jurisdictions --type county
```
### Census Download Fails
Cached for 7 days by default. For manual download:
1. Download from: https://www.census.gov/programs-surveys/gus.html
2. Place in `data/cache/census/`
3. Rerun discovery
## Next Steps
1. **Test Discovery**: Run with `--limit 100`
2. **Review Results**: Check `discovery-stats`
3. **Full Run**: Remove limit for production
4. **Start Scraping**: Use discovered URLs
5. **Schedule Re-Discovery**: Monthly updates
## Cost
**Total: $0** 🎉
- No API fees
- Uses free public datasets
- Only local/cloud compute costs
Compare to legacy approach:
- ~~Google Search API: $150~~
- ~~Bing Search API: $90~~
- **Pattern Matching: $0**
---
**Ready to discover 90,000+ government websites with zero external dependencies!** 🚀
# Jurisdiction Discovery System - Setup Guide
## Quick Start
### 1. Configure Search APIs
The discovery system requires search API keys to find government websites. You can use either Google Custom Search or Bing Search (or both for redundancy).
#### Option A: Google Custom Search API
1. **Enable the API**
- Visit [Google Cloud Console](https://console.cloud.google.com/)
- Create a new project or select existing
- Enable "Custom Search API"
2. **Create API Key**
- Go to "Credentials" → "Create Credentials" → "API Key"
- Copy your API key
3. **Create Search Engine**
- Visit [Google Custom Search](https://cse.google.com/cse/all)
- Click "Add" to create new search engine
- Set "Sites to search" to: `*.gov` (to focus on government sites)
- Copy your "Search Engine ID"
4. **Add to .env**
```bash
GOOGLE_SEARCH_API_KEY=your_google_api_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
```
**Pricing:** First 100 queries/day free, then $5 per 1,000 queries
#### Option B: Bing Search API
1. **Create Azure Account**
- Visit [Azure Portal](https://portal.azure.com/)
- Create account (free tier available)
2. **Create Bing Search Resource**
- Click "Create a resource" → Search for "Bing Search v7"
- Select pricing tier (F1 free tier: 1k queries/month)
- Create resource
3. **Get API Key**
- Go to your Bing Search resource
- Click "Keys and Endpoint"
- Copy one of the keys
4. **Add to .env**
```bash
BING_SEARCH_API_KEY=your_bing_api_key
```
**Pricing:** Free tier: 1,000 queries/month; Paid: $3 per 1,000 queries
### 2. Install Dependencies
All required packages are already in `requirements.txt`:
```bash
pip install -r requirements.txt
```
Key packages for discovery:
- `httpx==0.27.0` - Async HTTP client
- `beautifulsoup4==4.12.2` - HTML parsing
- `pyspark==3.5.0` - Data processing
- `delta-spark==3.0.0` - Delta Lake
### 3. Initialize Delta Lake
```bash
python main.py init
```
This creates the necessary Delta Lake tables.
### 4. Run Discovery Pipeline
#### Test Run (100 jurisdictions)
```bash
python main.py discover-jurisdictions --limit 100
```
Expected output:
```
📊 Bronze Layer Complete:
Total records: 90,735
Counties: 3,143
Municipalities: 19,495
...
📊 URL Discovery Complete:
Attempted: 100
Successful: 87
Homepages found: 87
Minutes URLs found: 65
Avg confidence: 0.72
📊 Gold Layer Complete:
Scraping targets created: 65
High priority (>150): 42
...
✅ Discovery Complete!
```
#### State-Specific Discovery
```bash
python main.py discover-jurisdictions --state CA
```
#### Full Production Run
```bash
# Discovers all ~30,000 high-priority jurisdictions
# Takes 4-6 hours with parallel processing
python main.py discover-jurisdictions
```
### 5. View Statistics
```bash
python main.py discovery-stats
```
Output:
```
📊 Jurisdiction Discovery Statistics
Bronze Layer (Raw Data):
Total jurisdictions: 90,735
- county: 3,143
- municipality: 19,495
- school_district: 13,051
- special_district: 38,542
- township: 16,504
Silver Layer (Discovered URLs):
Total discoveries: 27,483
Homepages found: 24,125 (87.8%)
Minutes URLs found: 18,562 (67.5%)
Avg confidence: 0.74
Gold Layer (Scraping Targets):
Total targets: 18,562
High priority: 12,340
- pending: 18,562
```
### 6. Start Scraping
```bash
# Scrape high-priority targets
python main.py scrape-batch --source discovered --limit 50 --priority 150
# Or scrape all pending targets (use with caution!)
python main.py scrape-batch --source discovered --limit 1000
```
## Using Databricks Notebook
For production deployment on Databricks:
1. **Upload Notebook**
```bash
databricks workspace import notebooks/Jurisdiction_Discovery.py \
-l PYTHON \
-f SOURCE \
/Users/your-email@company.com/Jurisdiction_Discovery
```
2. **Configure Secrets**
```bash
# Create secret scope
databricks secrets create-scope oral-health-app
# Add API keys
databricks secrets put-secret oral-health-app google-search-api-key
databricks secrets put-secret oral-health-app google-search-engine-id
databricks secrets put-secret oral-health-app bing-search-api-key
```
3. **Create Cluster**
- Runtime: 14.3 LTS or higher
- Node type: Standard_DS3_v2 (or similar)
- Workers: 2-4 (for parallel processing)
- Libraries: All from `requirements.txt`
4. **Run Notebook**
- Open notebook in Databricks workspace
- Attach to cluster
- Run all cells
## Cost Estimation
### API Costs
For discovering 30,000 jurisdictions:
| Provider | Free Tier | Paid Cost | Total Cost |
|----------|-----------|-----------|------------|
| Google | 100/day (3,000/month) | $5/1k | ~$135 |
| Bing | 1,000/month | $3/1k | ~$87 |
| **Both** | 4,000 free | Rest on Bing | ~$78 |
**Recommendation:** Use both APIs to maximize free tier usage.
### Compute Costs
**Local Development:**
- Free (uses local resources)
- ~4-6 hours for full discovery
**Databricks:**
- Cluster: ~$2-4/hour
- Total: ~$8-24 for full discovery
- Can use spot instances to reduce cost
### Re-discovery Schedule
- **Monthly**: Catch URL changes and new jurisdictions
- **Cost**: ~$10-20/month (many URLs cached)
## Troubleshooting
### Low Discovery Rate
**Problem:** Only finding 30-40% of URLs
**Solutions:**
1. Check API keys are correct
2. Verify API quotas not exceeded
3. Review failed discoveries:
```python
from pyspark.sql.functions import col
silver_df = spark.read.format("delta").load("silver/discovered_urls")
failed = silver_df.filter(col("homepage_url").isNull())
failed.show(20, truncate=False)
```
### Memory Errors
**Problem:** Out of memory during discovery
**Solutions:**
1. Process by state:
```bash
for state in CA TX NY FL PA OH IL MI NC GA; do
python main.py discover-jurisdictions --state $state
done
```
2. Increase Spark memory:
```python
spark = SparkSession.builder \
.config("spark.driver.memory", "8g") \
.config("spark.executor.memory", "8g") \
.getOrCreate()
```
3. Use Databricks cluster (more memory available)
### API Rate Limits
**Problem:** Hitting rate limits too quickly
**Solutions:**
1. Reduce batch size in `url_discovery_agent.py`:
```python
batch_size = 5 # Instead of 10
```
2. Add delays between batches:
```python
await asyncio.sleep(1) # After each batch
```
3. Use both Google and Bing to distribute load
### Census Data Download Fails
**Problem:** Census Bureau site unreachable
**Solutions:**
1. Use cached data (automatically cached for 7 days)
2. Manual download:
```bash
# Download files manually from Census Bureau
# Place in data/cache/census/
```
3. Check Census Bureau site status: https://www.census.gov/programs-surveys/gus.html
## Monitoring Progress
### Check Discovery Status
```sql
-- In Databricks SQL or Spark
SELECT
state,
COUNT(*) as total,
COUNT(homepage_url) as found,
ROUND(COUNT(homepage_url) * 100.0 / COUNT(*), 1) as success_rate
FROM silver.discovered_urls
GROUP BY state
ORDER BY success_rate DESC;
```
### Track Scraping Progress
```sql
SELECT
scraping_status,
COUNT(*) as count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM gold.scraping_targets), 1) as pct
FROM gold.scraping_targets
GROUP BY scraping_status;
```
## Next Steps
Once discovery is complete:
1. **Review High-Priority Targets**
- Check for false positives
- Validate CMS platform detection
2. **Start Scraping**
- Begin with top 100 high-priority sites
- Monitor document quality
- Adjust priority scores as needed
3. **Schedule Automation**
- Set up monthly re-discovery job
- Monitor for new jurisdictions
- Track URL changes
4. **Integration**
- Connect to existing scraper agents
- Feed documents to classification pipeline
- Generate advocacy opportunities
## Support
For issues or questions:
- GitHub Issues: [github.com/getcommunityone/open-navigator-for-engagement/issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues)
- Documentation: [JURISDICTION_DISCOVERY.md](JURISDICTION_DISCOVERY.md)
---
**Ready to discover 90,000+ government websites!** 🦷✨