open-navigator / docs /SCALE_AND_SEARCH_PATTERNS.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

Scale and Search Patterns: End-to-End Civic Tech Projects

This guide analyzes 6 additional civic tech projects focused on full-stack deployments, large-scale data aggregation, and public search portals. These complement our existing integration (Civic Scraper, City Scrapers, CDP, Engagic, Councilmatic) with new patterns for:

  • πŸ€– AI summarization (OpenTowns, MeetingBank)
  • πŸ” Multi-jurisdiction search (CivicBand, LocalView)
  • πŸ”” Keyword alerting (OpenTowns)
  • πŸ“Š Research-grade pipelines (LocalView, MeetingBank)
  • 🌍 International adaptability (OpenCouncil)

🎯 What's NEW vs. Our Existing Integration

Pattern Already Have NEW from These Projects
Platform detection βœ… Civic Scraper -
Event schema βœ… City Scrapers -
Video ingestion βœ… CDP βœ… LocalView scale patterns
Matter tracking βœ… Engagic -
Search UX βœ… Councilmatic βœ… CivicBand cross-jurisdiction
AI Summarization ❌ βœ… OpenTowns, MeetingBank
Keyword Alerts ❌ βœ… OpenTowns
Scale (1,000+ jurisdictions) ⚠️ Partial βœ… CivicBand, LocalView
International patterns ❌ βœ… OpenCouncil

πŸ“š Project Analysis

1. Council Data Project (CDP) ⭐ Already Integrated

Status: Already documented in INTEGRATION_GUIDE.md

Key patterns we already use:

  • Video transcript ingestion
  • Searchable transcript storage
  • Event indexing pipeline

See: docs/INTEGRATION_GUIDE.md Section 4


2. OpenTowns πŸ†• AI Summarization Pioneer

GitHub: https://opentowns.org
License: Open civic-tech (check specific repo)
Focus: Small towns, AI-generated summaries, keyword alerts

πŸ”₯ What to Adopt

A. AI Summarization Pattern

# They generate readable summaries from raw transcripts/PDFs
# Pattern: transcript β†’ summary β†’ key decisions

from openai import OpenAI
from models.meeting_event import MeetingEvent

async def generate_meeting_summary(event: MeetingEvent, transcript: str) -> dict:
    """
    OpenTowns pattern: Generate human-readable meeting summaries.
    
    Returns:
        {
            'executive_summary': str,      # 2-3 sentences
            'key_decisions': list[str],     # Bullet points
            'health_policy_items': list[str],  # Filtered for oral health
            'next_actions': list[str]       # Follow-up items
        }
    """
    client = OpenAI()
    
    prompt = f"""
    Summarize this local government meeting for public understanding.
    
    Meeting: {event.title}
    Date: {event.start.strftime('%B %d, %Y')}
    Transcript: {transcript[:10000]}  # First 10k chars
    
    Provide:
    1. Executive summary (2-3 sentences)
    2. Key decisions made (bullet points)
    3. Health policy items (if any)
    4. Next actions/follow-ups
    
    Focus on: What decisions were made? What happens next?
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cost-effective for summaries
        messages=[
            {"role": "system", "content": "You are a civic engagement assistant helping residents understand local government."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3  # Lower for factual accuracy
    )
    
    # Parse response into structured format
    summary_text = response.choices[0].message.content
    
    return {
        'executive_summary': extract_section(summary_text, 'Executive summary'),
        'key_decisions': extract_bullets(summary_text, 'Key decisions'),
        'health_policy_items': extract_bullets(summary_text, 'Health policy'),
        'next_actions': extract_bullets(summary_text, 'Next actions'),
        'raw_summary': summary_text
    }

B. Keyword Alert System

# OpenTowns sends alerts when keywords appear in meetings
# Pattern: Watch list β†’ match detection β†’ user notification

from typing import List, Dict
import re

class KeywordAlertSystem:
    """
    OpenTowns pattern: Alert users when keywords appear in meetings.
    """
    
    # Oral health keyword categories
    KEYWORD_CATEGORIES = {
        'fluoridation': [
            'fluoride', 'fluoridation', 'water treatment',
            'community water fluoridation', 'CWF'
        ],
        'dental_access': [
            'dental', 'dentist', 'oral health', 'teeth',
            'medicaid dental', 'dental clinic'
        ],
        'public_health': [
            'health department', 'public health', 'CDC',
            'preventive care', 'health equity'
        ]
    }
    
    def detect_keywords(self, text: str) -> Dict[str, List[str]]:
        """
        Find all matching keywords in text.
        
        Returns: {'fluoridation': ['fluoride', 'CWF'], ...}
        """
        text_lower = text.lower()
        matches = {}
        
        for category, keywords in self.KEYWORD_CATEGORIES.items():
            found = []
            for keyword in keywords:
                # Word boundary matching
                pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
                if re.search(pattern, text_lower):
                    found.append(keyword)
            
            if found:
                matches[category] = found
        
        return matches
    
    def generate_alert(self, event: MeetingEvent, matches: Dict[str, List[str]]) -> dict:
        """
        Create alert notification for users.
        """
        return {
            'alert_type': 'keyword_match',
            'jurisdiction': f"{event.jurisdiction_name}, {event.state_code}",
            'meeting_title': event.title,
            'meeting_date': event.start.isoformat(),
            'categories_matched': list(matches.keys()),
            'keywords_found': [kw for kws in matches.values() for kw in kws],
            'meeting_url': event.source,
            'priority': 'high' if 'fluoridation' in matches else 'medium'
        }

Implementation Priority: πŸ”₯ HIGH - Summaries make data usable for advocates


3. LocalView πŸ†• Research-Grade Scale

Website: https://www.localview.net
GitHub: https://mellonurbanism.harvard.edu/localview
License: Open-source data pipeline
Scale: Nationwide coverage, largest public dataset

πŸ”₯ What to Adopt

A. Scale Architecture Patterns

LocalView handles thousands of jurisdictions with:

  1. Batch processing (not real-time)
  2. Distributed storage (videos + transcripts)
  3. Quality metrics (completeness scoring)
# LocalView pattern: Process jurisdictions in batches with quality tracking

from dataclasses import dataclass
from datetime import datetime
from typing import Optional

@dataclass
class JurisdictionQuality:
    """
    LocalView pattern: Track data quality per jurisdiction.
    """
    jurisdiction_name: str
    state_code: str
    
    # Completeness metrics
    total_meetings_expected: int  # Based on calendar
    total_meetings_found: int
    meetings_with_agendas: int
    meetings_with_minutes: int
    meetings_with_videos: int
    meetings_with_transcripts: int
    
    # Freshness
    last_scraped: datetime
    last_meeting_found: Optional[datetime]
    scraping_frequency: str  # 'daily', 'weekly', 'monthly'
    
    # Health metrics
    consecutive_failures: int
    last_success: Optional[datetime]
    
    @property
    def completeness_score(self) -> float:
        """
        Overall data quality score (0-100).
        """
        if self.total_meetings_expected == 0:
            return 0.0
        
        found_rate = self.total_meetings_found / self.total_meetings_expected
        agenda_rate = self.meetings_with_agendas / max(self.total_meetings_found, 1)
        minutes_rate = self.meetings_with_minutes / max(self.total_meetings_found, 1)
        
        # Weighted average
        score = (
            found_rate * 40 +      # 40%: Finding meetings
            agenda_rate * 30 +      # 30%: Having agendas
            minutes_rate * 30       # 30%: Having minutes
        )
        
        return min(score * 100, 100.0)
    
    @property
    def health_status(self) -> str:
        """
        Scraper health: healthy, degraded, failed
        """
        if self.consecutive_failures >= 5:
            return 'failed'
        elif self.consecutive_failures >= 2:
            return 'degraded'
        else:
            return 'healthy'

B. Batch Processing Strategy

# LocalView processes in batches, not all-at-once

from pyspark.sql import SparkSession
from typing import Iterator

def process_jurisdictions_in_batches(
    spark: SparkSession,
    batch_size: int = 100,
    priority_filter: str = 'high'
) -> Iterator[dict]:
    """
    LocalView pattern: Process large numbers of jurisdictions efficiently.
    
    Strategy:
    1. Load high-priority jurisdictions first
    2. Process in batches to manage memory
    3. Track quality metrics per batch
    4. Resume from failures
    """
    # Load targets from Gold layer
    targets_df = spark.read.format("delta").load("data/delta/gold/scraping_targets")
    
    # Filter and sort
    priority_targets = targets_df \
        .filter(f"priority_tier = '{priority_filter}'") \
        .orderBy("priority_score", ascending=False)
    
    total_targets = priority_targets.count()
    
    # Process in batches
    for offset in range(0, total_targets, batch_size):
        batch_df = priority_targets.limit(batch_size).offset(offset)
        
        batch_results = {
            'batch_number': offset // batch_size + 1,
            'batch_size': batch_size,
            'jurisdictions_processed': 0,
            'meetings_found': 0,
            'errors': []
        }
        
        for row in batch_df.collect():
            try:
                # Scrape jurisdiction
                meetings = scrape_jurisdiction(row['url'], row['platform'])
                batch_results['jurisdictions_processed'] += 1
                batch_results['meetings_found'] += len(meetings)
                
            except Exception as e:
                batch_results['errors'].append({
                    'jurisdiction': row['jurisdiction_name'],
                    'error': str(e)
                })
        
        yield batch_results

Implementation Priority: πŸ”₯ HIGH - Essential for scaling to 32,333 municipalities


4. MeetingBank πŸ†• Summarization Research

Website: https://meetingbank.github.io
GitHub: Linked from site
License: Open dataset
Focus: 6 cities, high-quality summarization benchmark

πŸ”₯ What to Adopt

A. Summarization Quality Benchmarks

MeetingBank is used in academic research for summarization. They have:

  • Gold-standard human summaries (for validation)
  • Multiple summary lengths (short, medium, long)
  • Evaluation metrics (ROUGE, BERTScore)
# MeetingBank pattern: Validate AI summaries against quality benchmarks

from typing import Dict
import numpy as np

class SummaryQualityValidator:
    """
    MeetingBank pattern: Ensure AI summaries meet quality standards.
    """
    
    # Quality thresholds from academic research
    MIN_ROUGE_L = 0.25  # ROUGE-L F1 score
    MIN_LENGTH_RATIO = 0.05  # Summary should be 5-20% of original
    MAX_LENGTH_RATIO = 0.20
    
    def validate_summary(self, original: str, summary: str) -> Dict[str, any]:
        """
        Check if summary meets quality standards.
        """
        # Length checks
        orig_words = len(original.split())
        summ_words = len(summary.split())
        length_ratio = summ_words / orig_words if orig_words > 0 else 0
        
        # Basic quality checks
        checks = {
            'length_appropriate': self.MIN_LENGTH_RATIO <= length_ratio <= self.MAX_LENGTH_RATIO,
            'has_key_terms': self._check_key_terms(original, summary),
            'no_repetition': self._check_repetition(summary),
            'proper_structure': self._check_structure(summary),
        }
        
        return {
            'passes_validation': all(checks.values()),
            'checks': checks,
            'length_ratio': length_ratio,
            'word_count': summ_words,
            'quality_score': sum(checks.values()) / len(checks)
        }
    
    def _check_key_terms(self, original: str, summary: str) -> bool:
        """
        Ensure summary includes key terms from original.
        """
        # Extract important terms (simplified - use TF-IDF in production)
        orig_words = set(original.lower().split())
        summ_words = set(summary.lower().split())
        
        # At least 30% overlap of unique terms
        overlap = len(orig_words & summ_words) / len(orig_words)
        return overlap >= 0.30
    
    def _check_repetition(self, summary: str) -> bool:
        """
        Check for excessive repetition (indicates poor quality).
        """
        sentences = summary.split('.')
        unique_ratio = len(set(sentences)) / len(sentences) if sentences else 0
        return unique_ratio >= 0.80  # At least 80% unique sentences
    
    def _check_structure(self, summary: str) -> bool:
        """
        Check for proper summary structure.
        """
        # Should have multiple sentences
        sentences = [s.strip() for s in summary.split('.') if s.strip()]
        return len(sentences) >= 2 and len(sentences) <= 10

Implementation Priority: 🟑 MEDIUM - Important for quality, but MVP can use basic summaries


5. CivicBand πŸ†• Multi-Jurisdiction Search

Website: https://civic.band
GitHub: Linked from site (Raft Foundation)
Scale: 1,000+ municipalities
Focus: Google-like search across jurisdictions

πŸ”₯ What to Adopt

A. Cross-Jurisdiction Search Architecture

CivicBand lets users search "fluoridation" and get results from all municipalities at once.

# CivicBand pattern: Federated search across jurisdictions

from elasticsearch import Elasticsearch  # Or Meilisearch for open-source
from typing import List, Dict
from models.meeting_event import MeetingEvent

class CrossJurisdictionSearch:
    """
    CivicBand pattern: Search meetings across all jurisdictions.
    """
    
    def __init__(self):
        # Use Meilisearch (open-source) or Elasticsearch
        self.es = Elasticsearch(['http://localhost:9200'])
        self.index_name = 'meeting_events'
    
    def index_meeting(self, event: MeetingEvent):
        """
        Add meeting to search index.
        """
        doc = {
            'id': event.id,
            'title': event.title,
            'description': event.description,
            'jurisdiction': event.jurisdiction_name,
            'state': event.state_code,
            'date': event.start.isoformat(),
            'full_text': self._build_searchable_text(event),
            'agenda_url': next((link.href for link in event.links if 'agenda' in link.title.lower()), None),
            'oral_health_relevant': event.oral_health_relevant,
            'keywords': event.keywords_found
        }
        
        self.es.index(index=self.index_name, id=event.id, document=doc)
    
    def search(
        self,
        query: str,
        states: List[str] = None,
        date_range: tuple = None,
        oral_health_only: bool = False
    ) -> List[Dict]:
        """
        Search across all jurisdictions.
        
        Example:
            search("fluoridation", states=['AL', 'GA'], oral_health_only=True)
        """
        must_clauses = [
            {"multi_match": {
                "query": query,
                "fields": ["title^3", "description^2", "full_text"],  # Boost title matches
                "type": "best_fields"
            }}
        ]
        
        # Filter by state
        if states:
            must_clauses.append({"terms": {"state": states}})
        
        # Filter by date range
        if date_range:
            must_clauses.append({
                "range": {"date": {"gte": date_range[0], "lte": date_range[1]}}
            })
        
        # Filter oral health only
        if oral_health_only:
            must_clauses.append({"term": {"oral_health_relevant": True}})
        
        search_query = {
            "query": {"bool": {"must": must_clauses}},
            "size": 100,
            "highlight": {
                "fields": {
                    "title": {},
                    "description": {},
                    "full_text": {"fragment_size": 150}
                }
            },
            "sort": [
                {"_score": "desc"},
                {"date": "desc"}
            ]
        }
        
        results = self.es.search(index=self.index_name, body=search_query)
        
        return [{
            'jurisdiction': hit['_source']['jurisdiction'],
            'state': hit['_source']['state'],
            'title': hit['_source']['title'],
            'date': hit['_source']['date'],
            'snippet': hit.get('highlight', {}).get('full_text', [''])[0],
            'url': hit['_source']['agenda_url'],
            'relevance_score': hit['_score']
        } for hit in results['hits']['hits']]
    
    def _build_searchable_text(self, event: MeetingEvent) -> str:
        """
        Combine all text fields for indexing.
        """
        parts = [
            event.title or '',
            event.description or '',
            ' '.join(event.keywords_found),
            ' '.join(link.title for link in event.links)
        ]
        return ' '.join(parts)

B. Jurisdiction Faceting

# CivicBand shows result counts by jurisdiction

def get_search_facets(query: str) -> Dict[str, int]:
    """
    Show how many results per jurisdiction.
    
    Example output:
        {
            'Birmingham, AL': 12,
            'Atlanta, GA': 8,
            'Montgomery, AL': 5
        }
    """
    search_query = {
        "query": {"multi_match": {"query": query, "fields": ["title", "full_text"]}},
        "size": 0,  # We only want aggregations
        "aggs": {
            "by_jurisdiction": {
                "terms": {
                    "field": "jurisdiction.keyword",
                    "size": 50  # Top 50 jurisdictions
                },
                "aggs": {
                    "by_state": {
                        "terms": {"field": "state.keyword"}
                    }
                }
            }
        }
    }
    
    results = self.es.search(index=self.index_name, body=search_query)
    
    facets = {}
    for bucket in results['aggregations']['by_jurisdiction']['buckets']:
        jurisdiction = bucket['key']
        count = bucket['doc_count']
        state = bucket['by_state']['buckets'][0]['key']
        facets[f"{jurisdiction}, {state}"] = count
    
    return facets

Implementation Priority: 🟑 MEDIUM - Valuable for end-users, but scraping comes first


6. OpenCouncil πŸ†• International Adaptability

Website: https://opencouncil.gr
GitHub: https://github.com/schemalabz/opencouncil
License: Open-source
Focus: Greek councils, but adaptable to U.S.

πŸ”₯ What to Adopt

A. Internationalization Patterns

OpenCouncil works in Greece (different government structure). This teaches us:

  • Flexible schema (not hardcoded to U.S. structures)
  • Configurable jurisdiction types (councils, boards, commissions)
  • Multi-language support (not needed now, but good architecture)
# OpenCouncil pattern: Flexible jurisdiction configuration

from enum import Enum
from dataclasses import dataclass
from typing import List, Optional

class GovernmentLevel(Enum):
    """
    OpenCouncil pattern: Support multiple government structures.
    """
    MUNICIPAL = "municipal"          # City/town councils
    COUNTY = "county"                # County boards
    TOWNSHIP = "township"            # Township boards
    SCHOOL_DISTRICT = "school"       # School boards
    SPECIAL_DISTRICT = "special"     # Water, fire, etc.
    STATE = "state"                  # State agencies (future)

@dataclass
class JurisdictionConfig:
    """
    OpenCouncil pattern: Configure each jurisdiction's unique structure.
    """
    jurisdiction_name: str
    government_level: GovernmentLevel
    
    # Meeting schedule
    typical_meeting_frequency: str  # 'weekly', 'biweekly', 'monthly'
    typical_meeting_days: List[str]  # ['Monday', 'Thursday']
    typical_meeting_time: str  # '18:00'
    
    # Website structure
    calendar_url: Optional[str]
    agenda_url_pattern: Optional[str]  # Template: "https://example.gov/agenda-{date}"
    minutes_url_pattern: Optional[str]
    
    # Legislative bodies
    bodies: List[str]  # ['City Council', 'Planning Commission', 'Board of Health']
    
    # Custom fields
    metadata: dict  # For jurisdiction-specific data

# Example: Configure Birmingham, AL
BIRMINGHAM_CONFIG = JurisdictionConfig(
    jurisdiction_name="Birmingham",
    government_level=GovernmentLevel.MUNICIPAL,
    typical_meeting_frequency='biweekly',
    typical_meeting_days=['Tuesday'],
    typical_meeting_time='18:00',
    calendar_url="https://birminghamal.gov/council/meetings",
    bodies=['City Council', 'Board of Health', 'Planning Commission'],
    metadata={'population': 200733, 'oral_health_priority': 'high'}
)

Implementation Priority: 🟒 LOW - Good architecture, but not urgent


🎯 Implementation Roadmap

Phase 1: AI Summarization (OpenTowns pattern) πŸ”₯

Priority: HIGH
Timeline: 1-2 weeks
Depends on: Existing OpenAI integration

# TODO: Implement in extraction/summarizer.py
- [ ] Generate executive summaries from meeting transcripts
- [ ] Extract key decisions as bullet points
- [ ] Identify health policy items
- [ ] Add quality validation (MeetingBank patterns)

Phase 2: Keyword Alerts (OpenTowns pattern) πŸ”₯

Priority: HIGH
Timeline: 1 week
Depends on: Meeting data ingestion

# TODO: Implement in alerts/keyword_monitor.py
- [ ] Define oral health keyword categories
- [ ] Pattern matching with word boundaries
- [ ] Generate alerts for users
- [ ] Email/webhook notification system

Phase 3: Scale Architecture (LocalView pattern) πŸ”₯

Priority: HIGH
Timeline: 2 weeks
Depends on: Platform scrapers

# TODO: Implement in discovery/batch_processor.py
- [ ] Quality metrics per jurisdiction
- [ ] Batch processing (100 at a time)
- [ ] Failure tracking and retry
- [ ] Completeness scoring

Phase 4: Multi-Jurisdiction Search (CivicBand pattern) 🟑

Priority: MEDIUM
Timeline: 2-3 weeks
Depends on: Significant meeting data

# TODO: Implement in search/federated_search.py
- [ ] Set up Elasticsearch or Meilisearch
- [ ] Index all meetings
- [ ] Cross-jurisdiction search API
- [ ] Jurisdiction faceting

Phase 5: Quality Validation (MeetingBank pattern) 🟑

Priority: MEDIUM
Timeline: 1 week
Depends on: AI summarization

# TODO: Implement in extraction/quality_validator.py
- [ ] Summary length validation
- [ ] Key term extraction
- [ ] Repetition detection
- [ ] Structure checking

Phase 6: Flexible Config (OpenCouncil pattern) 🟒

Priority: LOW
Timeline: 1 week
Depends on: None

# TODO: Implement in config/jurisdiction_configs.py
- [ ] Per-jurisdiction configuration
- [ ] Meeting schedule patterns
- [ ] Legislative body tracking

πŸ“Š Comparison with Existing Integration

Capability Original 5 Projects New 6 Projects Status
Platform detection βœ… Civic Scraper - Complete
Event schema βœ… City Scrapers - Complete
Video ingestion βœ… CDP βœ… LocalView (scale) Need scale patterns
Matter tracking βœ… Engagic - Complete
Person/vote tracking βœ… Councilmatic - Roadmapped
AI Summarization ❌ βœ… OpenTowns, MeetingBank TODO: High priority
Keyword Alerts ❌ βœ… OpenTowns TODO: High priority
Cross-jurisdiction search ⚠️ Basic βœ… CivicBand TODO: Medium priority
Quality metrics ❌ βœ… LocalView, MeetingBank TODO: Medium priority
Batch processing ⚠️ Basic βœ… LocalView TODO: High priority

πŸ’» Quick Start: Integrate Summarization

Here's how to add OpenTowns-style summarization right now:

# File: extraction/summarizer.py

from openai import OpenAI
from models.meeting_event import MeetingEvent
from config.settings import settings

client = OpenAI(api_key=settings.openai_api_key)

def summarize_meeting(event: MeetingEvent, full_text: str) -> dict:
    """
    Generate OpenTowns-style summary with oral health focus.
    """
    prompt = f"""
    You are summarizing a local government meeting for public health advocates.
    
    Meeting: {event.title}
    Jurisdiction: {event.jurisdiction_name}, {event.state_code}
    Date: {event.start.strftime('%B %d, %Y')}
    
    Full text (first 8000 chars):
    {full_text[:8000]}
    
    Provide:
    1. Executive Summary (2-3 sentences)
    2. Key Decisions (bullet list)
    3. Oral Health Items (if any - fluoridation, dental access, etc.)
    4. Next Actions (follow-ups, future meetings)
    
    Focus on: What was decided? What's happening next?
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You summarize local government meetings for public understanding."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    )
    
    return {
        'summary': response.choices[0].message.content,
        'model': 'gpt-4o-mini',
        'tokens_used': response.usage.total_tokens
    }

# Usage:
# summary = summarize_meeting(event, full_transcript)
# event.description = summary['summary']

🎬 Next Steps

  1. Implement AI summarization (OpenTowns pattern) β†’ Makes data usable
  2. Add keyword alerts (OpenTowns pattern) β†’ Engage advocates
  3. Add batch processing (LocalView pattern) β†’ Scale to 1,000+ jurisdictions
  4. Build search interface (CivicBand pattern) β†’ User discovery
  5. Add quality metrics (LocalView + MeetingBank) β†’ Monitor data health

πŸ“– References


πŸ“ License & Attribution

All patterns documented here are derived from open-source projects:

  • OpenTowns: Open civic-tech project
  • LocalView: Open-source (Harvard Mellon Urbanism)
  • MeetingBank: Open dataset
  • CivicBand: Open-source (Raft Foundation)
  • OpenCouncil: Open-source (MIT)
  • CDP: MIT License

When using code patterns, maintain attribution per each project's license.