# Jurisdiction Discovery - Quick Start Guide ## No External APIs Required! 🎉 This discovery system uses **pattern-based matching** and **public datasets** only. No search API keys needed! ## Quick Start ### 1. Install Dependencies All required packages are in `requirements.txt`: ```bash pip install -r requirements.txt ``` Key packages: - `httpx` - HTTP client for URL verification - `beautifulsoup4` - HTML parsing for web crawling - `pyspark` - Data processing - `delta-spark` - Delta Lake storage ### 2. Initialize Delta Lake ```bash python main.py init ``` ### 3. Run Discovery ```bash # Test with 100 jurisdictions python main.py discover-jurisdictions --limit 100 # Single state python main.py discover-jurisdictions --state CA # Full discovery (~30k jurisdictions, 12-18 hours) python main.py discover-jurisdictions ``` ### 4. View Results ```bash python main.py discovery-stats ``` Expected output: ``` 📊 Jurisdiction Discovery Statistics Bronze Layer (Raw Data): Total jurisdictions: 90,735 - county: 3,143 - municipality: 19,495 - school_district: 13,051 Silver Layer (Discovered URLs): Total discoveries: 87 Homepages found: 78 (89.7%) Minutes URLs found: 65 (74.7%) Avg confidence: 0.82 Gold Layer (Scraping Targets): Total targets: 65 High priority: 42 ``` ### 5. Start Scraping ```bash python main.py scrape-batch --source discovered --limit 50 ``` ## How It Works ### Strategy 1: GSA Domain Matching The system directly matches jurisdiction names to the GSA .gov registry: ```python "Sacramento County" → normalized: "sacramento" GSA lookup → "sacramento.gov" ✓ Confidence: 1.0 ``` ### Strategy 2: URL Pattern Generation Common government URL patterns are tested: **Counties:** - `co.{name}.{state}.us` - `{name}county.gov` **Cities:** - `www.{name}.gov` - `cityof{name}.gov` **School Districts:** - `{name}.k12.{state}.us` - `{name}schools.org` **Example:** ```python "Fresno" (municipality, CA) Test: https://www.fresno.gov → ✓ Found Confidence: 0.9 ``` ### Strategy 3: Web Crawling Once a homepage is found: 1. Crawl for "minutes", "agendas" links 2. Detect CMS platforms (Granicus, CivicClerk, etc.) 3. Boost confidence for .gov domains ## Performance ### Expected Results - **Counties**: 85-95% discovery rate - **Cities > 10k**: 75-90% discovery rate - **School Districts**: 70-85% discovery rate - **Processing Time**: ~3-5 min per 100 jurisdictions - **Total Cost**: $0 (no API fees!) ### Optimization **Parallel Processing:** ```bash # Process multiple states in parallel for state in CA TX NY FL PA; do python main.py discover-jurisdictions --state $state & done wait ``` **Databricks Notebook:** For production runs, use the Databricks notebook: 1. Upload `notebooks/Jurisdiction_Discovery.py` 2. Create cluster (2-4 workers) 3. Run with Spark parallel processing ## Troubleshooting ### Low Discovery Rate Check if URL patterns need adjustment for specific regions: ```python # In discovery/url_discovery_agent.py # Add regional patterns, e.g.: if state == "MA": # Massachusetts has unique patterns patterns.extend([ (f"https://www.{name_slug}.ma.us", 0.85), ]) ``` ### Memory Errors Process in smaller batches: ```bash # By state python main.py discover-jurisdictions --state CA # Or by type python main.py discover-jurisdictions --type county ``` ### Census Download Fails Cached for 7 days by default. For manual download: 1. Download from: https://www.census.gov/programs-surveys/gus.html 2. Place in `data/cache/census/` 3. Rerun discovery ## Next Steps 1. **Test Discovery**: Run with `--limit 100` 2. **Review Results**: Check `discovery-stats` 3. **Full Run**: Remove limit for production 4. **Start Scraping**: Use discovered URLs 5. **Schedule Re-Discovery**: Monthly updates ## Cost **Total: $0** 🎉 - No API fees - Uses free public datasets - Only local/cloud compute costs Compare to legacy approach: - ~~Google Search API: $150~~ - ~~Bing Search API: $90~~ - **Pattern Matching: $0** --- **Ready to discover 90,000+ government websites with zero external dependencies!** 🚀 # Jurisdiction Discovery System - Setup Guide ## Quick Start ### 1. Configure Search APIs The discovery system requires search API keys to find government websites. You can use either Google Custom Search or Bing Search (or both for redundancy). #### Option A: Google Custom Search API 1. **Enable the API** - Visit [Google Cloud Console](https://console.cloud.google.com/) - Create a new project or select existing - Enable "Custom Search API" 2. **Create API Key** - Go to "Credentials" → "Create Credentials" → "API Key" - Copy your API key 3. **Create Search Engine** - Visit [Google Custom Search](https://cse.google.com/cse/all) - Click "Add" to create new search engine - Set "Sites to search" to: `*.gov` (to focus on government sites) - Copy your "Search Engine ID" 4. **Add to .env** ```bash GOOGLE_SEARCH_API_KEY=your_google_api_key GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id ``` **Pricing:** First 100 queries/day free, then $5 per 1,000 queries #### Option B: Bing Search API 1. **Create Azure Account** - Visit [Azure Portal](https://portal.azure.com/) - Create account (free tier available) 2. **Create Bing Search Resource** - Click "Create a resource" → Search for "Bing Search v7" - Select pricing tier (F1 free tier: 1k queries/month) - Create resource 3. **Get API Key** - Go to your Bing Search resource - Click "Keys and Endpoint" - Copy one of the keys 4. **Add to .env** ```bash BING_SEARCH_API_KEY=your_bing_api_key ``` **Pricing:** Free tier: 1,000 queries/month; Paid: $3 per 1,000 queries ### 2. Install Dependencies All required packages are already in `requirements.txt`: ```bash pip install -r requirements.txt ``` Key packages for discovery: - `httpx==0.27.0` - Async HTTP client - `beautifulsoup4==4.12.2` - HTML parsing - `pyspark==3.5.0` - Data processing - `delta-spark==3.0.0` - Delta Lake ### 3. Initialize Delta Lake ```bash python main.py init ``` This creates the necessary Delta Lake tables. ### 4. Run Discovery Pipeline #### Test Run (100 jurisdictions) ```bash python main.py discover-jurisdictions --limit 100 ``` Expected output: ``` 📊 Bronze Layer Complete: Total records: 90,735 Counties: 3,143 Municipalities: 19,495 ... 📊 URL Discovery Complete: Attempted: 100 Successful: 87 Homepages found: 87 Minutes URLs found: 65 Avg confidence: 0.72 📊 Gold Layer Complete: Scraping targets created: 65 High priority (>150): 42 ... ✅ Discovery Complete! ``` #### State-Specific Discovery ```bash python main.py discover-jurisdictions --state CA ``` #### Full Production Run ```bash # Discovers all ~30,000 high-priority jurisdictions # Takes 4-6 hours with parallel processing python main.py discover-jurisdictions ``` ### 5. View Statistics ```bash python main.py discovery-stats ``` Output: ``` 📊 Jurisdiction Discovery Statistics Bronze Layer (Raw Data): Total jurisdictions: 90,735 - county: 3,143 - municipality: 19,495 - school_district: 13,051 - special_district: 38,542 - township: 16,504 Silver Layer (Discovered URLs): Total discoveries: 27,483 Homepages found: 24,125 (87.8%) Minutes URLs found: 18,562 (67.5%) Avg confidence: 0.74 Gold Layer (Scraping Targets): Total targets: 18,562 High priority: 12,340 - pending: 18,562 ``` ### 6. Start Scraping ```bash # Scrape high-priority targets python main.py scrape-batch --source discovered --limit 50 --priority 150 # Or scrape all pending targets (use with caution!) python main.py scrape-batch --source discovered --limit 1000 ``` ## Using Databricks Notebook For production deployment on Databricks: 1. **Upload Notebook** ```bash databricks workspace import notebooks/Jurisdiction_Discovery.py \ -l PYTHON \ -f SOURCE \ /Users/your-email@company.com/Jurisdiction_Discovery ``` 2. **Configure Secrets** ```bash # Create secret scope databricks secrets create-scope oral-health-app # Add API keys databricks secrets put-secret oral-health-app google-search-api-key databricks secrets put-secret oral-health-app google-search-engine-id databricks secrets put-secret oral-health-app bing-search-api-key ``` 3. **Create Cluster** - Runtime: 14.3 LTS or higher - Node type: Standard_DS3_v2 (or similar) - Workers: 2-4 (for parallel processing) - Libraries: All from `requirements.txt` 4. **Run Notebook** - Open notebook in Databricks workspace - Attach to cluster - Run all cells ## Cost Estimation ### API Costs For discovering 30,000 jurisdictions: | Provider | Free Tier | Paid Cost | Total Cost | |----------|-----------|-----------|------------| | Google | 100/day (3,000/month) | $5/1k | ~$135 | | Bing | 1,000/month | $3/1k | ~$87 | | **Both** | 4,000 free | Rest on Bing | ~$78 | **Recommendation:** Use both APIs to maximize free tier usage. ### Compute Costs **Local Development:** - Free (uses local resources) - ~4-6 hours for full discovery **Databricks:** - Cluster: ~$2-4/hour - Total: ~$8-24 for full discovery - Can use spot instances to reduce cost ### Re-discovery Schedule - **Monthly**: Catch URL changes and new jurisdictions - **Cost**: ~$10-20/month (many URLs cached) ## Troubleshooting ### Low Discovery Rate **Problem:** Only finding 30-40% of URLs **Solutions:** 1. Check API keys are correct 2. Verify API quotas not exceeded 3. Review failed discoveries: ```python from pyspark.sql.functions import col silver_df = spark.read.format("delta").load("silver/discovered_urls") failed = silver_df.filter(col("homepage_url").isNull()) failed.show(20, truncate=False) ``` ### Memory Errors **Problem:** Out of memory during discovery **Solutions:** 1. Process by state: ```bash for state in CA TX NY FL PA OH IL MI NC GA; do python main.py discover-jurisdictions --state $state done ``` 2. Increase Spark memory: ```python spark = SparkSession.builder \ .config("spark.driver.memory", "8g") \ .config("spark.executor.memory", "8g") \ .getOrCreate() ``` 3. Use Databricks cluster (more memory available) ### API Rate Limits **Problem:** Hitting rate limits too quickly **Solutions:** 1. Reduce batch size in `url_discovery_agent.py`: ```python batch_size = 5 # Instead of 10 ``` 2. Add delays between batches: ```python await asyncio.sleep(1) # After each batch ``` 3. Use both Google and Bing to distribute load ### Census Data Download Fails **Problem:** Census Bureau site unreachable **Solutions:** 1. Use cached data (automatically cached for 7 days) 2. Manual download: ```bash # Download files manually from Census Bureau # Place in data/cache/census/ ``` 3. Check Census Bureau site status: https://www.census.gov/programs-surveys/gus.html ## Monitoring Progress ### Check Discovery Status ```sql -- In Databricks SQL or Spark SELECT state, COUNT(*) as total, COUNT(homepage_url) as found, ROUND(COUNT(homepage_url) * 100.0 / COUNT(*), 1) as success_rate FROM silver.discovered_urls GROUP BY state ORDER BY success_rate DESC; ``` ### Track Scraping Progress ```sql SELECT scraping_status, COUNT(*) as count, ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM gold.scraping_targets), 1) as pct FROM gold.scraping_targets GROUP BY scraping_status; ``` ## Next Steps Once discovery is complete: 1. **Review High-Priority Targets** - Check for false positives - Validate CMS platform detection 2. **Start Scraping** - Begin with top 100 high-priority sites - Monitor document quality - Adjust priority scores as needed 3. **Schedule Automation** - Set up monthly re-discovery job - Monitor for new jurisdictions - Track URL changes 4. **Integration** - Connect to existing scraper agents - Feed documents to classification pipeline - Generate advocacy opportunities ## Support For issues or questions: - GitHub Issues: [github.com/getcommunityone/open-navigator-for-engagement/issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues) - Documentation: [JURISDICTION_DISCOVERY.md](JURISDICTION_DISCOVERY.md) --- **Ready to discover 90,000+ government websites!** 🦷✨