Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # Jurisdiction Discovery System | |
| ## Overview | |
| The **Jurisdiction Discovery System** automatically identifies and tracks over 90,000 local government units across the United States, discovering their official websites and meeting minutes URLs using **pattern-based matching** and **public datasets**. | |
| ## ✅ Sustainable Approach (No Deprecated APIs) | |
| This system uses **vendor-neutral, production-ready methods**: | |
| ✅ **Pattern Matching** - Generate URLs from jurisdiction names using common government patterns | |
| ✅ **GSA .gov Registry** - Direct matching with authoritative domain list | |
| ✅ **Web Crawling** - Verify URLs and discover minutes pages | |
| ✅ **Public Datasets** - Census Bureau + GSA official data | |
| **Does NOT use:** | |
| ❌ Google Custom Search API (deprecated for production) | |
| ❌ Bing Search API (legacy, not recommended) | |
| **Benefits:** | |
| - 🆓 **Zero API costs** - No search API fees | |
| - 🔒 **Reliable** - No rate limits or quotas | |
| - ♻️ **Sustainable** - Vendor-neutral, future-proof | |
| - 📊 **Reproducible** - Deterministic pattern matching | |
| ## Architecture | |
| ### Discovery Strategy | |
| ``` | |
| ┌─────────────────────────────────────────────────────────┐ | |
| │ BRONZE LAYER (Public Datasets) │ | |
| ├─────────────────────────────────────────────────────────┤ | |
| │ Census Bureau GID: 90,735 jurisdictions │ | |
| │ GSA .gov Registry: 12,000+ validated domains │ | |
| └─────────────────────────────────────────────────────────┘ | |
| │ | |
| ▼ | |
| ┌─────────────────────────────────────────────────────────┐ | |
| │ SILVER LAYER (Pattern-Based Discovery) │ | |
| ├─────────────────────────────────────────────────────────┤ | |
| │ │ | |
| │ Strategy 1: GSA Domain Matching │ | |
| │ • Direct lookup: "Sacramento County" → sacramento.gov │ | |
| │ • Fuzzy matching with 75%+ similarity │ | |
| │ • Confidence: 0.95-1.0 │ | |
| │ │ | |
| │ Strategy 2: URL Pattern Generation │ | |
| │ • Counties: co.{name}.{state}.us, {name}county.gov │ | |
| │ • Cities: www.{name}.gov, cityof{name}.gov │ | |
| │ • Schools: {name}.k12.{state}.us, {name}schools.org │ | |
| │ • Verify accessibility with HTTP HEAD/GET │ | |
| │ • Confidence: 0.6-0.9 │ | |
| │ │ | |
| │ Strategy 3: Web Crawling │ | |
| │ • Find "minutes" or "agendas" links on homepage │ | |
| │ • Detect CMS platforms (Granicus, CivicClerk, etc.) │ | |
| │ • Confidence boost for .gov domains │ | |
| │ │ | |
| └─────────────────────────────────────────────────────────┘ | |
| │ | |
| ▼ | |
| ┌─────────────────────────────────────────────────────────┐ | |
| │ GOLD LAYER (Scraping Targets) │ | |
| ├─────────────────────────────────────────────────────────┤ | |
| │ • Confidence score > 0.6 │ | |
| │ • Has minutes URL or known CMS │ | |
| │ • Prioritized by population & domain quality │ | |
| │ • Ready for scraper agents │ | |
| └─────────────────────────────────────────────────────────┘ | |
| ``` | |
| ## Usage | |
| ```bash | |
| # Run discovery (no API keys needed!) | |
| python main.py discover-jurisdictions --limit 100 | |
| # View statistics | |
| python main.py discovery-stats | |
| # Start scraping discovered sites | |
| python main.py scrape-batch --source discovered | |
| ``` | |
| ## Performance | |
| ### Expected Discovery Rates | |
| | Jurisdiction Type | Success Rate | Notes | | |
| |-------------------|--------------|-------| | |
| | Counties | **85-95%** | Best coverage (official .gov domains) | | |
| | Cities > 10k pop | **75-90%** | Good pattern matching | | |
| | School Districts | **70-85%** | Consistent naming conventions | | |
| | Townships | **50-65%** | More variation in URLs | | |
| ### Benchmarks | |
| - **100 jurisdictions**: ~3-5 minutes | |
| - **30,000 jurisdictions**: ~12-18 hours | |
| - **Total cost**: **$0** (no API fees) | |
| ## References | |
| - Census Bureau GID: https://www.census.gov/programs-surveys/gus.html | |
| - GSA .gov Domains: https://github.com/cisagov/dotgov-data | |
| - Government URL Best Practices: https://digital.gov/topics/url-management/ | |
| For detailed documentation, see [JURISDICTION_DISCOVERY_SETUP.md](JURISDICTION_DISCOVERY_SETUP.md) | |
| --- | |
| **Production-ready with zero external API dependencies!** 🦷✨ | |
| # Jurisdiction Discovery System | |
| ## Overview | |
| The **Jurisdiction Discovery System** automatically identifies and tracks over 90,000 local government units across the United States, discovering their official websites and meeting minutes URLs. | |
| ## Architecture | |
| ### Medallion Pipeline | |
| ``` | |
| ┌─────────────────────────────────────────────────────────┐ | |
| │ BRONZE LAYER │ | |
| │ (Raw Data Ingestion) │ | |
| ├─────────────────────────────────────────────────────────┤ | |
| │ │ | |
| │ Census Bureau Data: │ | |
| │ • 3,143 counties │ | |
| │ • 19,495 municipalities │ | |
| │ • 16,504 townships │ | |
| │ • 13,051 school districts │ | |
| │ • 38,542 special districts │ | |
| │ ──────────────── │ | |
| │ Total: ~90,735 jurisdictions │ | |
| │ │ | |
| │ GSA .gov Domain List: │ | |
| │ • 12,000+ validated .gov domains │ | |
| │ • Domain type classification │ | |
| │ • Organization mapping │ | |
| │ │ | |
| └─────────────────────────────────────────────────────────┘ | |
| │ | |
| ▼ | |
| ┌─────────────────────────────────────────────────────────┐ | |
| │ SILVER LAYER │ | |
| │ (URL Discovery & Validation) │ | |
| ├─────────────────────────────────────────────────────────┤ | |
| │ │ | |
| │ For each jurisdiction: │ | |
| │ 1. Search for official website (Google/Bing API) │ | |
| │ 2. Validate against .gov domain list │ | |
| │ 3. Crawl homepage for "minutes" links │ | |
| │ 4. Detect CMS platform (Granicus, CivicClerk, etc.) │ | |
| │ 5. Assign confidence score │ | |
| │ │ | |
| │ Output: Discovered URLs with metadata │ | |
| │ │ | |
| └─────────────────────────────────────────────────────────┘ | |
| │ | |
| ▼ | |
| ┌─────────────────────────────────────────────────────────┐ | |
| │ GOLD LAYER │ | |
| │ (Scraping Targets) │ | |
| ├─────────────────────────────────────────────────────────┤ | |
| │ │ | |
| │ Filtered & Prioritized: │ | |
| │ • Has minutes URL │ | |
| │ • Confidence score > 0.6 │ | |
| │ • Preferably .gov domain │ | |
| │ • Population-weighted priority │ | |
| │ │ | |
| │ Ready for: Scraper Agent │ | |
| │ │ | |
| └─────────────────────────────────────────────────────────┘ | |
| ``` | |
| ## Data Sources | |
| ### 1. U.S. Census Bureau - Government Integrated Directory (GID) | |
| **URL:** https://www.census.gov/programs-surveys/gus.html | |
| **What it provides:** | |
| - Complete list of all local governments | |
| - FIPS codes (Federal Information Processing Standards) | |
| - Population data | |
| - Functional status | |
| - Geographic hierarchy | |
| **Usage:** | |
| ```python | |
| from discovery.census_ingestion import CensusGovernmentIngestion | |
| census = CensusGovernmentIngestion() | |
| dataframes = await census.ingest_all_jurisdictions() | |
| ``` | |
| ### 2. GSA .gov Domain List | |
| **URL:** https://github.com/cisagov/dotgov-data | |
| **What it provides:** | |
| - All registered .gov domains (~12,000+) | |
| - Domain type (Federal, State, County, City, etc.) | |
| - Organization name and location | |
| - Security contact information | |
| **Usage:** | |
| ```python | |
| from discovery.gsa_domains import GSADomainList | |
| gsa = GSADomainList() | |
| csv_path = await gsa.download_domain_list() | |
| domains_df = gsa.parse_domains(csv_path) | |
| ``` | |
| ### 3. Search API Integration | |
| **Supported:** | |
| - Google Custom Search API | |
| - Bing Search API | |
| **Configuration:** | |
| ```bash | |
| # .env | |
| GOOGLE_SEARCH_API_KEY=your_key_here | |
| GOOGLE_SEARCH_ENGINE_ID=your_engine_id | |
| BING_SEARCH_API_KEY=your_key_here | |
| ``` | |
| ## Usage | |
| ### Quick Start | |
| ```bash | |
| # Run complete discovery pipeline | |
| python -m discovery.discovery_pipeline | |
| # Or use CLI | |
| python main.py discover-jurisdictions --limit 100 | |
| ``` | |
| ### Step-by-Step | |
| ```python | |
| from discovery.discovery_pipeline import DiscoveryPipeline | |
| pipeline = DiscoveryPipeline() | |
| # 1. Bronze Layer: Ingest raw data | |
| await pipeline.run_bronze_ingestion() | |
| # 2. Silver Layer: Discover URLs | |
| await pipeline.run_url_discovery(limit=100) # limit for testing | |
| # 3. Gold Layer: Create scraping targets | |
| pipeline.create_scraping_targets() | |
| # Or run all at once | |
| await pipeline.run_full_pipeline(discovery_limit=100) | |
| ``` | |
| ## Output Tables | |
| ### Bronze Layer | |
| ``` | |
| bronze/jurisdictions/ | |
| ├── counties/ # 3,143 U.S. counties | |
| ├── municipalities/ # 19,495 cities/towns | |
| ├── townships/ # 16,504 townships | |
| ├── school_districts/ # 13,051 school districts | |
| ├── special_districts/ # 38,542 special districts | |
| └── unified/ # Combined view | |
| bronze/gov_domains/ # 12,000+ .gov domains | |
| ``` | |
| ### Silver Layer | |
| ``` | |
| silver/discovered_urls/ # URLs with metadata | |
| Columns: | |
| - jurisdiction_id (FIPS code) | |
| - jurisdiction_name | |
| - state | |
| - homepage_url | |
| - minutes_url | |
| - cms_platform | |
| - is_gov_domain | |
| - confidence_score | |
| - discovery_method | |
| - last_verified | |
| ``` | |
| ### Gold Layer | |
| ``` | |
| gold/scraping_targets/ # Ready for scraping | |
| Columns: | |
| - jurisdiction_id | |
| - jurisdiction_name | |
| - jurisdiction_type | |
| - state | |
| - population | |
| - homepage_url | |
| - minutes_url | |
| - cms_platform | |
| - priority_score | |
| - scraping_status | |
| - last_scraped | |
| - documents_found | |
| ``` | |
| ## URL Discovery Agent | |
| The `URLDiscoveryAgent` performs intelligent website discovery: | |
| ### Discovery Strategy | |
| 1. **Search Phase** | |
| ```python | |
| query = f"{jurisdiction_name} {state} {type} official website" | |
| urls = await search_google(query) | |
| ``` | |
| 2. **Validation Phase** | |
| ```python | |
| is_valid = validate_against_gsa_domains(url) | |
| ``` | |
| 3. **Crawling Phase** | |
| ```python | |
| minutes_url = await crawl_for_minutes(homepage) | |
| ``` | |
| 4. **CMS Detection** | |
| ```python | |
| cms = detect_cms_platform(url, html) | |
| # Detects: Granicus, CivicClerk, Municode, Laserfiche, etc. | |
| ``` | |
| ### Confidence Scoring | |
| ```python | |
| confidence = ( | |
| 1.0 if is_gov_domain else 0.5 + | |
| 0.3 if has_minutes_url else 0.0 + | |
| 0.2 if cms_detected else 0.0 | |
| ) | |
| ``` | |
| ## CMS Platform Detection | |
| The system automatically detects major government CMS platforms: | |
| | CMS Platform | Signature URLs | Estimated Usage | | |
| |--------------|----------------|-----------------| | |
| | Granicus/Legistar | granicus.com, legistar.com | 4,000+ cities | | |
| | CivicClerk | civicclerk.com, civicweb.net | 2,500+ cities | | |
| | Municode | municode.com | 3,500+ cities | | |
| | Laserfiche | laserfiche.com | 1,200+ cities | | |
| | PrimeGov | primegov.com | 800+ cities | | |
| ## Prioritization Strategy | |
| Jurisdictions are prioritized based on: | |
| ```python | |
| priority_score = ( | |
| 100 if is_gov_domain else 50 + | |
| 50 if cms_platform_detected else 0 + | |
| int(confidence_score * 100) + | |
| population_weight | |
| ) | |
| ``` | |
| ### Focus Areas | |
| **High Priority:** | |
| - All counties (public health authority) | |
| - Cities > 10,000 population (water fluoridation) | |
| - All school districts (dental screening programs) | |
| **Medium Priority:** | |
| - Cities 5,000-10,000 population | |
| - Special districts with health/water authority | |
| **Low Priority:** | |
| - Townships < 5,000 population | |
| - Non-health special districts | |
| ## API Requirements | |
| ### Google Custom Search API | |
| **Setup:** | |
| 1. Enable Custom Search API: https://console.cloud.google.com/ | |
| 2. Create Search Engine: https://cse.google.com/cse/all | |
| 3. Get API Key and Engine ID | |
| **Pricing:** $5 per 1,000 queries (first 100/day free) | |
| ### Bing Search API | |
| **Setup:** | |
| 1. Create Azure account: https://azure.microsoft.com/ | |
| 2. Create Bing Search resource | |
| 3. Get API key from Azure Portal | |
| **Pricing:** $3 per 1,000 queries | |
| ### Cost Estimation | |
| For 30,000 jurisdictions: | |
| - Google: $1,500 (30k queries @ $5/1k) | |
| - Bing: $900 (30k queries @ $3/1k) | |
| **Tip:** Mix free tier + one paid service to reduce costs. | |
| ## Performance | |
| ### Benchmarks | |
| - Census data download: ~30 seconds | |
| - GSA domain list download: ~5 seconds | |
| - URL discovery per jurisdiction: ~2-3 seconds | |
| - Full 100 jurisdiction discovery: ~5 minutes | |
| - Full 30,000 jurisdiction discovery: ~20-25 hours | |
| ### Optimization | |
| **Parallel Processing:** | |
| ```python | |
| batch_size = 10 # Process 10 jurisdictions simultaneously | |
| # Reduces 30,000 jurisdiction discovery to ~4-5 hours | |
| ``` | |
| **Caching:** | |
| - Census data: Cached for 7 days | |
| - GSA domains: Cached for 1 day | |
| - Search results: No cache (URLs change) | |
| ## Integration with Scraping | |
| Once discovery completes, scraping targets are ready: | |
| ```python | |
| # Load scraping targets | |
| targets_df = spark.read.format("delta").load("gold/scraping_targets") | |
| # Filter by priority | |
| high_priority = targets_df.filter(col("priority_score") > 150) | |
| # Pass to scraper agent | |
| for target in high_priority.collect(): | |
| await scraper_agent.scrape( | |
| url=target.minutes_url, | |
| jurisdiction_id=target.jurisdiction_id, | |
| cms_platform=target.cms_platform | |
| ) | |
| ``` | |
| ## Monitoring | |
| ### Track Discovery Progress | |
| ```sql | |
| -- Discovery success rate | |
| SELECT | |
| COUNT(*) as total, | |
| COUNT(homepage_url) as homepages_found, | |
| COUNT(minutes_url) as minutes_found, | |
| AVG(confidence_score) as avg_confidence | |
| FROM silver.discovered_urls; | |
| -- By state | |
| SELECT | |
| state, | |
| COUNT(*) as total, | |
| COUNT(minutes_url) as with_minutes, | |
| ROUND(COUNT(minutes_url) * 100.0 / COUNT(*), 1) as success_rate | |
| FROM silver.discovered_urls | |
| GROUP BY state | |
| ORDER BY success_rate DESC; | |
| ``` | |
| ### Track Scraping Status | |
| ```sql | |
| -- Scraping progress | |
| SELECT | |
| scraping_status, | |
| COUNT(*) as count | |
| FROM gold.scraping_targets | |
| GROUP BY scraping_status; | |
| -- Documents found | |
| SELECT | |
| jurisdiction_type, | |
| SUM(documents_found) as total_docs | |
| FROM gold.scraping_targets | |
| GROUP BY jurisdiction_type; | |
| ``` | |
| ## Troubleshooting | |
| ### Issue: Low discovery rate | |
| **Solution:** Check search API keys and quotas | |
| ```bash | |
| # Test Google API | |
| curl "https://www.googleapis.com/customsearch/v1?key=YOUR_KEY&cx=YOUR_CX&q=test" | |
| # Test Bing API | |
| curl -H "Ocp-Apim-Subscription-Key: YOUR_KEY" \ | |
| "https://api.bing.microsoft.com/v7.0/search?q=test" | |
| ``` | |
| ### Issue: Census download fails | |
| **Solution:** Use cached data or download manually | |
| ```python | |
| # Manual download | |
| from discovery.census_ingestion import CensusGovernmentIngestion | |
| census = CensusGovernmentIngestion() | |
| # Download specific type | |
| csv_path = await census.download_census_data("municipalities") | |
| ``` | |
| ### Issue: Memory errors with large datasets | |
| **Solution:** Increase Spark memory or process in batches | |
| ```python | |
| # Increase memory | |
| spark = SparkSession.builder \ | |
| .config("spark.driver.memory", "8g") \ | |
| .config("spark.executor.memory", "8g") \ | |
| .getOrCreate() | |
| # Or process in state-by-state batches | |
| for state in ["CA", "TX", "NY", ...]: | |
| df = jurisdictions_df.filter(col("state") == state) | |
| await discover_batch(df) | |
| ``` | |
| ## Next Steps | |
| 1. **Run initial discovery** | |
| ```bash | |
| python main.py discover-jurisdictions --limit 1000 | |
| ``` | |
| 2. **Review results** | |
| ```bash | |
| python main.py show-discovery-stats | |
| ``` | |
| 3. **Start scraping** | |
| ```bash | |
| python main.py scrape --source discovered | |
| ``` | |
| ## References | |
| - Census Bureau GID: https://www.census.gov/programs-surveys/gus.html | |
| - GSA .gov Domains: https://github.com/cisagov/dotgov-data | |
| - Google Custom Search: https://developers.google.com/custom-search | |
| - Bing Search API: https://www.microsoft.com/en-us/bing/apis/bing-web-search-api | |
| --- | |
| **Your jurisdiction discovery system is now ready to identify tens of thousands of local governments for oral health policy monitoring!** 🦷✨ | |