Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # Jurisdiction Discovery - Quick Start Guide | |
| ## No External APIs Required! 🎉 | |
| This discovery system uses **pattern-based matching** and **public datasets** only. No search API keys needed! | |
| ## Quick Start | |
| ### 1. Install Dependencies | |
| All required packages are in `requirements.txt`: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Key packages: | |
| - `httpx` - HTTP client for URL verification | |
| - `beautifulsoup4` - HTML parsing for web crawling | |
| - `pyspark` - Data processing | |
| - `delta-spark` - Delta Lake storage | |
| ### 2. Initialize Delta Lake | |
| ```bash | |
| python main.py init | |
| ``` | |
| ### 3. Run Discovery | |
| ```bash | |
| # Test with 100 jurisdictions | |
| python main.py discover-jurisdictions --limit 100 | |
| # Single state | |
| python main.py discover-jurisdictions --state CA | |
| # Full discovery (~30k jurisdictions, 12-18 hours) | |
| python main.py discover-jurisdictions | |
| ``` | |
| ### 4. View Results | |
| ```bash | |
| python main.py discovery-stats | |
| ``` | |
| Expected output: | |
| ``` | |
| 📊 Jurisdiction Discovery Statistics | |
| Bronze Layer (Raw Data): | |
| Total jurisdictions: 90,735 | |
| - county: 3,143 | |
| - municipality: 19,495 | |
| - school_district: 13,051 | |
| Silver Layer (Discovered URLs): | |
| Total discoveries: 87 | |
| Homepages found: 78 (89.7%) | |
| Minutes URLs found: 65 (74.7%) | |
| Avg confidence: 0.82 | |
| Gold Layer (Scraping Targets): | |
| Total targets: 65 | |
| High priority: 42 | |
| ``` | |
| ### 5. Start Scraping | |
| ```bash | |
| python main.py scrape-batch --source discovered --limit 50 | |
| ``` | |
| ## How It Works | |
| ### Strategy 1: GSA Domain Matching | |
| The system directly matches jurisdiction names to the GSA .gov registry: | |
| ```python | |
| "Sacramento County" → normalized: "sacramento" | |
| GSA lookup → "sacramento.gov" ✓ | |
| Confidence: 1.0 | |
| ``` | |
| ### Strategy 2: URL Pattern Generation | |
| Common government URL patterns are tested: | |
| **Counties:** | |
| - `co.{name}.{state}.us` | |
| - `{name}county.gov` | |
| **Cities:** | |
| - `www.{name}.gov` | |
| - `cityof{name}.gov` | |
| **School Districts:** | |
| - `{name}.k12.{state}.us` | |
| - `{name}schools.org` | |
| **Example:** | |
| ```python | |
| "Fresno" (municipality, CA) | |
| Test: https://www.fresno.gov → ✓ Found | |
| Confidence: 0.9 | |
| ``` | |
| ### Strategy 3: Web Crawling | |
| Once a homepage is found: | |
| 1. Crawl for "minutes", "agendas" links | |
| 2. Detect CMS platforms (Granicus, CivicClerk, etc.) | |
| 3. Boost confidence for .gov domains | |
| ## Performance | |
| ### Expected Results | |
| - **Counties**: 85-95% discovery rate | |
| - **Cities > 10k**: 75-90% discovery rate | |
| - **School Districts**: 70-85% discovery rate | |
| - **Processing Time**: ~3-5 min per 100 jurisdictions | |
| - **Total Cost**: $0 (no API fees!) | |
| ### Optimization | |
| **Parallel Processing:** | |
| ```bash | |
| # Process multiple states in parallel | |
| for state in CA TX NY FL PA; do | |
| python main.py discover-jurisdictions --state $state & | |
| done | |
| wait | |
| ``` | |
| **Databricks Notebook:** | |
| For production runs, use the Databricks notebook: | |
| 1. Upload `notebooks/Jurisdiction_Discovery.py` | |
| 2. Create cluster (2-4 workers) | |
| 3. Run with Spark parallel processing | |
| ## Troubleshooting | |
| ### Low Discovery Rate | |
| Check if URL patterns need adjustment for specific regions: | |
| ```python | |
| # In discovery/url_discovery_agent.py | |
| # Add regional patterns, e.g.: | |
| if state == "MA": # Massachusetts has unique patterns | |
| patterns.extend([ | |
| (f"https://www.{name_slug}.ma.us", 0.85), | |
| ]) | |
| ``` | |
| ### Memory Errors | |
| Process in smaller batches: | |
| ```bash | |
| # By state | |
| python main.py discover-jurisdictions --state CA | |
| # Or by type | |
| python main.py discover-jurisdictions --type county | |
| ``` | |
| ### Census Download Fails | |
| Cached for 7 days by default. For manual download: | |
| 1. Download from: https://www.census.gov/programs-surveys/gus.html | |
| 2. Place in `data/cache/census/` | |
| 3. Rerun discovery | |
| ## Next Steps | |
| 1. **Test Discovery**: Run with `--limit 100` | |
| 2. **Review Results**: Check `discovery-stats` | |
| 3. **Full Run**: Remove limit for production | |
| 4. **Start Scraping**: Use discovered URLs | |
| 5. **Schedule Re-Discovery**: Monthly updates | |
| ## Cost | |
| **Total: $0** 🎉 | |
| - No API fees | |
| - Uses free public datasets | |
| - Only local/cloud compute costs | |
| Compare to legacy approach: | |
| - ~~Google Search API: $150~~ | |
| - ~~Bing Search API: $90~~ | |
| - **Pattern Matching: $0** | |
| --- | |
| **Ready to discover 90,000+ government websites with zero external dependencies!** 🚀 | |
| # Jurisdiction Discovery System - Setup Guide | |
| ## Quick Start | |
| ### 1. Configure Search APIs | |
| The discovery system requires search API keys to find government websites. You can use either Google Custom Search or Bing Search (or both for redundancy). | |
| #### Option A: Google Custom Search API | |
| 1. **Enable the API** | |
| - Visit [Google Cloud Console](https://console.cloud.google.com/) | |
| - Create a new project or select existing | |
| - Enable "Custom Search API" | |
| 2. **Create API Key** | |
| - Go to "Credentials" → "Create Credentials" → "API Key" | |
| - Copy your API key | |
| 3. **Create Search Engine** | |
| - Visit [Google Custom Search](https://cse.google.com/cse/all) | |
| - Click "Add" to create new search engine | |
| - Set "Sites to search" to: `*.gov` (to focus on government sites) | |
| - Copy your "Search Engine ID" | |
| 4. **Add to .env** | |
| ```bash | |
| GOOGLE_SEARCH_API_KEY=your_google_api_key | |
| GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id | |
| ``` | |
| **Pricing:** First 100 queries/day free, then $5 per 1,000 queries | |
| #### Option B: Bing Search API | |
| 1. **Create Azure Account** | |
| - Visit [Azure Portal](https://portal.azure.com/) | |
| - Create account (free tier available) | |
| 2. **Create Bing Search Resource** | |
| - Click "Create a resource" → Search for "Bing Search v7" | |
| - Select pricing tier (F1 free tier: 1k queries/month) | |
| - Create resource | |
| 3. **Get API Key** | |
| - Go to your Bing Search resource | |
| - Click "Keys and Endpoint" | |
| - Copy one of the keys | |
| 4. **Add to .env** | |
| ```bash | |
| BING_SEARCH_API_KEY=your_bing_api_key | |
| ``` | |
| **Pricing:** Free tier: 1,000 queries/month; Paid: $3 per 1,000 queries | |
| ### 2. Install Dependencies | |
| All required packages are already in `requirements.txt`: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Key packages for discovery: | |
| - `httpx==0.27.0` - Async HTTP client | |
| - `beautifulsoup4==4.12.2` - HTML parsing | |
| - `pyspark==3.5.0` - Data processing | |
| - `delta-spark==3.0.0` - Delta Lake | |
| ### 3. Initialize Delta Lake | |
| ```bash | |
| python main.py init | |
| ``` | |
| This creates the necessary Delta Lake tables. | |
| ### 4. Run Discovery Pipeline | |
| #### Test Run (100 jurisdictions) | |
| ```bash | |
| python main.py discover-jurisdictions --limit 100 | |
| ``` | |
| Expected output: | |
| ``` | |
| 📊 Bronze Layer Complete: | |
| Total records: 90,735 | |
| Counties: 3,143 | |
| Municipalities: 19,495 | |
| ... | |
| 📊 URL Discovery Complete: | |
| Attempted: 100 | |
| Successful: 87 | |
| Homepages found: 87 | |
| Minutes URLs found: 65 | |
| Avg confidence: 0.72 | |
| 📊 Gold Layer Complete: | |
| Scraping targets created: 65 | |
| High priority (>150): 42 | |
| ... | |
| ✅ Discovery Complete! | |
| ``` | |
| #### State-Specific Discovery | |
| ```bash | |
| python main.py discover-jurisdictions --state CA | |
| ``` | |
| #### Full Production Run | |
| ```bash | |
| # Discovers all ~30,000 high-priority jurisdictions | |
| # Takes 4-6 hours with parallel processing | |
| python main.py discover-jurisdictions | |
| ``` | |
| ### 5. View Statistics | |
| ```bash | |
| python main.py discovery-stats | |
| ``` | |
| Output: | |
| ``` | |
| 📊 Jurisdiction Discovery Statistics | |
| Bronze Layer (Raw Data): | |
| Total jurisdictions: 90,735 | |
| - county: 3,143 | |
| - municipality: 19,495 | |
| - school_district: 13,051 | |
| - special_district: 38,542 | |
| - township: 16,504 | |
| Silver Layer (Discovered URLs): | |
| Total discoveries: 27,483 | |
| Homepages found: 24,125 (87.8%) | |
| Minutes URLs found: 18,562 (67.5%) | |
| Avg confidence: 0.74 | |
| Gold Layer (Scraping Targets): | |
| Total targets: 18,562 | |
| High priority: 12,340 | |
| - pending: 18,562 | |
| ``` | |
| ### 6. Start Scraping | |
| ```bash | |
| # Scrape high-priority targets | |
| python main.py scrape-batch --source discovered --limit 50 --priority 150 | |
| # Or scrape all pending targets (use with caution!) | |
| python main.py scrape-batch --source discovered --limit 1000 | |
| ``` | |
| ## Using Databricks Notebook | |
| For production deployment on Databricks: | |
| 1. **Upload Notebook** | |
| ```bash | |
| databricks workspace import notebooks/Jurisdiction_Discovery.py \ | |
| -l PYTHON \ | |
| -f SOURCE \ | |
| /Users/your-email@company.com/Jurisdiction_Discovery | |
| ``` | |
| 2. **Configure Secrets** | |
| ```bash | |
| # Create secret scope | |
| databricks secrets create-scope oral-health-app | |
| # Add API keys | |
| databricks secrets put-secret oral-health-app google-search-api-key | |
| databricks secrets put-secret oral-health-app google-search-engine-id | |
| databricks secrets put-secret oral-health-app bing-search-api-key | |
| ``` | |
| 3. **Create Cluster** | |
| - Runtime: 14.3 LTS or higher | |
| - Node type: Standard_DS3_v2 (or similar) | |
| - Workers: 2-4 (for parallel processing) | |
| - Libraries: All from `requirements.txt` | |
| 4. **Run Notebook** | |
| - Open notebook in Databricks workspace | |
| - Attach to cluster | |
| - Run all cells | |
| ## Cost Estimation | |
| ### API Costs | |
| For discovering 30,000 jurisdictions: | |
| | Provider | Free Tier | Paid Cost | Total Cost | | |
| |----------|-----------|-----------|------------| | |
| | Google | 100/day (3,000/month) | $5/1k | ~$135 | | |
| | Bing | 1,000/month | $3/1k | ~$87 | | |
| | **Both** | 4,000 free | Rest on Bing | ~$78 | | |
| **Recommendation:** Use both APIs to maximize free tier usage. | |
| ### Compute Costs | |
| **Local Development:** | |
| - Free (uses local resources) | |
| - ~4-6 hours for full discovery | |
| **Databricks:** | |
| - Cluster: ~$2-4/hour | |
| - Total: ~$8-24 for full discovery | |
| - Can use spot instances to reduce cost | |
| ### Re-discovery Schedule | |
| - **Monthly**: Catch URL changes and new jurisdictions | |
| - **Cost**: ~$10-20/month (many URLs cached) | |
| ## Troubleshooting | |
| ### Low Discovery Rate | |
| **Problem:** Only finding 30-40% of URLs | |
| **Solutions:** | |
| 1. Check API keys are correct | |
| 2. Verify API quotas not exceeded | |
| 3. Review failed discoveries: | |
| ```python | |
| from pyspark.sql.functions import col | |
| silver_df = spark.read.format("delta").load("silver/discovered_urls") | |
| failed = silver_df.filter(col("homepage_url").isNull()) | |
| failed.show(20, truncate=False) | |
| ``` | |
| ### Memory Errors | |
| **Problem:** Out of memory during discovery | |
| **Solutions:** | |
| 1. Process by state: | |
| ```bash | |
| for state in CA TX NY FL PA OH IL MI NC GA; do | |
| python main.py discover-jurisdictions --state $state | |
| done | |
| ``` | |
| 2. Increase Spark memory: | |
| ```python | |
| spark = SparkSession.builder \ | |
| .config("spark.driver.memory", "8g") \ | |
| .config("spark.executor.memory", "8g") \ | |
| .getOrCreate() | |
| ``` | |
| 3. Use Databricks cluster (more memory available) | |
| ### API Rate Limits | |
| **Problem:** Hitting rate limits too quickly | |
| **Solutions:** | |
| 1. Reduce batch size in `url_discovery_agent.py`: | |
| ```python | |
| batch_size = 5 # Instead of 10 | |
| ``` | |
| 2. Add delays between batches: | |
| ```python | |
| await asyncio.sleep(1) # After each batch | |
| ``` | |
| 3. Use both Google and Bing to distribute load | |
| ### Census Data Download Fails | |
| **Problem:** Census Bureau site unreachable | |
| **Solutions:** | |
| 1. Use cached data (automatically cached for 7 days) | |
| 2. Manual download: | |
| ```bash | |
| # Download files manually from Census Bureau | |
| # Place in data/cache/census/ | |
| ``` | |
| 3. Check Census Bureau site status: https://www.census.gov/programs-surveys/gus.html | |
| ## Monitoring Progress | |
| ### Check Discovery Status | |
| ```sql | |
| -- In Databricks SQL or Spark | |
| SELECT | |
| state, | |
| COUNT(*) as total, | |
| COUNT(homepage_url) as found, | |
| ROUND(COUNT(homepage_url) * 100.0 / COUNT(*), 1) as success_rate | |
| FROM silver.discovered_urls | |
| GROUP BY state | |
| ORDER BY success_rate DESC; | |
| ``` | |
| ### Track Scraping Progress | |
| ```sql | |
| SELECT | |
| scraping_status, | |
| COUNT(*) as count, | |
| ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM gold.scraping_targets), 1) as pct | |
| FROM gold.scraping_targets | |
| GROUP BY scraping_status; | |
| ``` | |
| ## Next Steps | |
| Once discovery is complete: | |
| 1. **Review High-Priority Targets** | |
| - Check for false positives | |
| - Validate CMS platform detection | |
| 2. **Start Scraping** | |
| - Begin with top 100 high-priority sites | |
| - Monitor document quality | |
| - Adjust priority scores as needed | |
| 3. **Schedule Automation** | |
| - Set up monthly re-discovery job | |
| - Monitor for new jurisdictions | |
| - Track URL changes | |
| 4. **Integration** | |
| - Connect to existing scraper agents | |
| - Feed documents to classification pipeline | |
| - Generate advocacy opportunities | |
| ## Support | |
| For issues or questions: | |
| - GitHub Issues: [github.com/getcommunityone/open-navigator-for-engagement/issues](https://github.com/getcommunityone/open-navigator-for-engagement/issues) | |
| - Documentation: [JURISDICTION_DISCOVERY.md](JURISDICTION_DISCOVERY.md) | |
| --- | |
| **Ready to discover 90,000+ government websites!** 🦷✨ | |