Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

File size: 16,298 Bytes

896453f

# Integration Guide: Reusing Open-Source Municipal Scraping Logic

## Overview
This guide shows how to integrate proven patterns from established open-source projects into the Oral Health Policy Pulse scraping pipeline.

## Current State
✅ **You already have:**
- Census Gazetteer data with 85,302 jurisdictions (names + FIPS codes)
- GSA .gov domain matching
- 76 discovered URLs ready for scraping
- Legistar platform references in codebase
- Base ScraperAgent class in `agents/scraper.py`

---

## 1. Civic Scraper Integration
**Repository:** `biglocalnews/civic-scraper`
**License:** Apache 2.0 (✅ Compatible)

### What to Adopt:
#### A. Platform Detection Logic
```python
# They have excellent platform detection
# Location: civic_scraper/platforms/__init__.py

PLATFORMS = {
    'legistar': LegistarScraper,
    'granicus': GranicusScraper,
    'calagenda': CalAgendaScraper,
    'civicplus': CivicPlusScraper
}

def detect_platform(url: str) -> Optional[str]:
    """Auto-detect which platform a URL uses"""
    if 'legistar.com' in url or '/Legistar/' in url:
        return 'legistar'
    elif 'granicus.com' in url or '/Mediasite/' in url:
        return 'granicus'
    # ... more patterns
```

**Your Action:** Add `discovery/platform_detector.py` using their patterns

#### B. Document Downloader with Retry Logic
```python
# civic_scraper/download.py has robust downloading
# Features:
# - Exponential backoff
# - Content-type validation
# - Duplicate detection via hash
# - Progress tracking

async def download_document(url: str, session: httpx.AsyncClient) -> bytes:
    """Download with retries and validation"""
    for attempt in range(3):
        try:
            response = await session.get(url, timeout=30.0)
            response.raise_for_status()
            
            # Validate it's actually a document
            content_type = response.headers.get('content-type', '')
            if 'pdf' in content_type or 'html' in content_type:
                return response.content
        except Exception as e:
            if attempt == 2:
                raise
            await asyncio.sleep(2 ** attempt)
```

**Your Action:** Enhance `agents/scraper.py` with their retry patterns

---

## 2. City Scrapers Integration
**Repository:** `city-scrapers/city-scrapers`
**License:** MIT (✅ Compatible)

### What to Adopt:
#### A. Standardized Event Schema
```python
# They normalize all meeting data to a common format
# city_scrapers/core/models.py

@dataclass
class Event:
    title: str
    description: str
    classification: str  # "Board", "Commission", "Council"
    start: datetime
    end: Optional[datetime]
    all_day: bool
    location: Dict[str, Any]
    links: List[Dict[str, str]]  # [{"title": "Agenda", "href": "..."}]
    source: str
    
# Classification types they use:
CLASSIFICATIONS = [
    "Board",
    "Commission", 
    "Committee",
    "Council",
    "Town Hall",
    "Public Hearing"
]
```

**Your Action:** Create `models/meeting_event.py` with this schema for your Silver layer

#### B. Scraper Testing Framework
```python
# They have excellent test patterns
# tests/test_scrapers.py

def test_scraper():
    """Test with frozen HTML responses"""
    scraper = CityScraper()
    
    # Use saved HTML files to avoid live requests during testing
    with open('tests/fixtures/sample_calendar.html') as f:
        results = scraper.parse(f.read())
    
    assert len(results) > 0
    assert results[0].title
    assert results[0].source
```

**Your Action:** Add `tests/fixtures/` directory with sample HTML from different platforms

---

## 3. Council Data Project (CDP) Integration
**Repository:** `CouncilDataProject/cdp-scrapers`
**License:** MIT (✅ Compatible)

### What to Adopt:
#### A. Generic Ingestion Pipeline
```python
# CDP has a beautiful generic scraper pipeline
# cdp_scrapers/scraper_utils.py

class IngestionModel:
    """Standard format for ingested data"""
    sessions: List[Session]  # Individual meetings
    
@dataclass
class Session:
    video_uri: Optional[str]
    session_datetime: datetime
    session_index: int
    caption_uri: Optional[str]
    
@dataclass  
class EventMinutesItem:
    name: str
    minutes_item: MinutesItem
    
    
def reduced_list(items: List[Any], key_attr: str) -> List[Any]:
    """Deduplicate items by a key attribute"""
    seen = set()
    result = []
    for item in items:
        key = getattr(item, key_attr)
        if key not in seen:
            seen.add(key)
            result.append(item)
    return result
```

**Your Action:** Create `models/ingestion.py` based on their schemas

#### B. Video Transcript Integration (Future)
```python
# CDP processes meeting videos into searchable transcripts
# This is advanced but incredibly valuable

# They use:
# - AWS Transcribe / Google Speech-to-Text
# - Sentence indexing with timestamps
# - Speaker diarization (who said what)

# You could add this in Phase 2 after document scraping works
```

**Your Action:** Document in `docs/ROADMAP.md` for future implementation

---

## 4. Engagic Integration
**Repository:** `Engagic/engagic`
**License:** Check repo (likely AGPL)

### What to Adopt:
#### A. "Matter" Tracking Across Meetings
```python
# Engagic tracks individual legislative items across meetings
# This is PERFECT for oral health policy tracking

@dataclass
class Matter:
    matter_id: str
    matter_number: str  # "Bill 2024-001"
    title: str
    type: str  # "Ordinance", "Resolution", "Motion"
    first_introduced: datetime
    status: str  # "Introduced", "Committee", "Passed", "Failed"
    votes: List[Vote]
    related_documents: List[str]
    
# Track how a fluoridation ordinance evolves:
# Meeting 1: Introduced (just mentioned in minutes)
# Meeting 2: Committee review (document link added)
# Meeting 3: Public hearing (comments recorded)
# Meeting 4: Final vote (result captured)
```

**Your Action:** Create `models/matter.py` for tracking policy evolution

#### B. LLM-Powered Document Parsing
```python
# Engagic uses LLMs to extract structure from "blob" PDFs
# You already have OpenAI configured!

async def extract_agenda_items(pdf_text: str) -> List[AgendaItem]:
    """Use GPT to extract structured items from unstructured text"""
    prompt = """
    Extract agenda items from this meeting minutes text.
    For each item, identify:
    - Item number
    - Title
    - Description  
    - Any votes or decisions
    - Keywords related to health, dental, fluoride, water, public health
    
    Return JSON array.
    """
    
    response = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You extract structured data from government documents"},
            {"role": "user", "content": f"{prompt}\n\n{pdf_text}"}
        ],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)
```

**Your Action:** Add `extraction/llm_parser.py` using your existing OpenAI setup

---

## 5. Councilmatic Integration
**Repository:** `datamade/councilmatic-starter-template`
**License:** MIT (✅ Compatible)

### What to Adopt:
#### A. Person/Organization Tracking
```python
# Councilmatic tracks who voted on what
# Useful for understanding power dynamics around oral health policy

@dataclass
class Person:
    name: str
    role: str  # "Council Member", "Mayor", "Commissioner"
    district: Optional[str]
    party: Optional[str]
    
@dataclass
class Vote:
    motion: str
    option: str  # "yes", "no", "abstain"
    person: Person
    date: datetime
```

**Your Action:** Add to `models/governance.py`

#### B. Search Interface Patterns
```python
# They have excellent search UX
# filters.py shows what users want:

SEARCH_FILTERS = [
    "date_range",
    "topic",  # ["health", "water", "budget"]
    "organization",  # Which board/commission
    "document_type",  # ["agenda", "minutes", "transcript"]
    "status",  # ["pending", "passed", "failed"]
]

# Your FastAPI endpoints could mirror this
@app.get("/api/search")
async def search_documents(
    query: str,
    topics: List[str] = Query(default=["oral_health", "fluoridation"]),
    date_from: Optional[date] = None,
    date_to: Optional[date] = None,
    state: Optional[str] = None
):
    """Search scraped documents with filters"""
    # Query your Delta Lake Gold layer
```

**Your Action:** Add to `api/routes/search.py` (create if doesn't exist)

---

## Implementation Priorities

### Phase 1: Foundation (Week 1)
- [ ] **Platform Detection** - Add `discovery/platform_detector.py` from Civic Scraper patterns
- [ ] **Standardized Schema** - Create `models/meeting_event.py` from City Scrapers
- [ ] **Enhanced Downloader** - Improve `agents/scraper.py` retry logic

### Phase 2: Scraping (Week 2-3)
- [ ] **Legistar Scraper** - Implement full Legistar support using Civic Scraper patterns
- [ ] **Generic HTML Parser** - Use BeautifulSoup patterns from City Scrapers
- [ ] **PDF Extraction** - Add PyPDF2/pdfplumber support

### Phase 3: Intelligence (Week 4)
- [ ] **LLM Parser** - Add `extraction/llm_parser.py` from Engagic patterns
- [ ] **Matter Tracking** - Create `models/matter.py` for policy evolution
- [ ] **Keyword Detection** - Oral health, fluoridation, dental policy detection

### Phase 4: Scale (Week 5+)
- [ ] **Test All 76 URLs** - Run full scraper on discovered targets
- [ ] **Expand to All Municipalities** - Process all 32,333 jurisdictions
- [ ] **Video Transcripts** - CDP-style video processing (future)

---

## Code Snippets to Add Now

### 1. Platform Detector
**File:** `discovery/platform_detector.py`
```python
"""
Platform detection for municipal websites.
Based on patterns from biglocalnews/civic-scraper.
"""
from typing import Optional
from urllib.parse import urlparse

PLATFORM_PATTERNS = {
    'legistar': [
        'legistar.com',
        '/Legistar/',
        '/LegislationDetail.aspx',
        '/Calendar.aspx'
    ],
    'granicus': [
        'granicus.com',
        '/Mediasite/',
        '/ViewPublisher.php'
    ],
    'municode': [
        'municode.com',
        '/meeting_minutes'
    ],
    'civicplus': [
        'civicplus.com',
        '/AgendaCenter/',
        '/DocumentCenter/'
    ]
}

def detect_platform(url: str) -> Optional[str]:
    """
    Detect which platform a municipality website uses.
    
    Args:
        url: Municipality website URL
        
    Returns:
        Platform name or None if unknown
    """
    url_lower = url.lower()
    
    for platform, patterns in PLATFORM_PATTERNS.items():
        if any(pattern.lower() in url_lower for pattern in patterns):
            return platform
    
    return None


def get_scraper_class(platform: str):
    """Get appropriate scraper class for platform"""
    from scrapers.legistar import LegistarScraper
    from scrapers.granicus import GranicusScraper
    from scrapers.generic import GenericScraper
    
    scrapers = {
        'legistar': LegistarScraper,
        'granicus': GranicusScraper
    }
    
    return scrapers.get(platform, GenericScraper)
```

### 2. Meeting Event Model
**File:** `models/meeting_event.py`
```python
"""
Standardized meeting event model.
Based on City Scrapers schema.
"""
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional, List, Dict, Any

@dataclass
class Location:
    name: str
    address: Optional[str] = None
    city: Optional[str] = None
    state: Optional[str] = None

@dataclass
class Link:
    title: str  # "Agenda", "Minutes", "Video"
    href: str
    content_type: Optional[str] = None  # "application/pdf", "text/html"

@dataclass
class MeetingEvent:
    """
    Normalized representation of a government meeting.
    Compatible with City Scrapers format.
    """
    # Core identification
    id: str  # Hash of source_url + start_time
    title: str
    description: str
    classification: str  # "Board", "Commission", "Council", "Committee"
    
    # Temporal
    start: datetime
    end: Optional[datetime] = None
    all_day: bool = False
    
    # Spatial
    location: Location
    
    # Content
    links: List[Link] = field(default_factory=list)
    source: str = ""  # Original URL
    
    # Metadata
    jurisdiction_name: str = ""
    state_code: str = ""
    fips_code: Optional[str] = None
    scraped_at: datetime = field(default_factory=datetime.utcnow)
    
    # Health policy relevance (your special sauce!)
    oral_health_relevant: bool = False
    keywords_found: List[str] = field(default_factory=list)
    confidence_score: float = 0.0
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary for Delta Lake storage"""
        return {
            'id': self.id,
            'title': self.title,
            'description': self.description,
            'classification': self.classification,
            'start': self.start.isoformat(),
            'end': self.end.isoformat() if self.end else None,
            'all_day': self.all_day,
            'location_name': self.location.name,
            'location_address': self.location.address,
            'links': [{'title': l.title, 'href': l.href} for l in self.links],
            'source': self.source,
            'jurisdiction_name': self.jurisdiction_name,
            'state_code': self.state_code,
            'fips_code': self.fips_code,
            'scraped_at': self.scraped_at.isoformat(),
            'oral_health_relevant': self.oral_health_relevant,
            'keywords_found': self.keywords_found,
            'confidence_score': self.confidence_score
        }
```

### 3. Enhanced Discovery Pipeline
**Add to:** `discovery/discovery_pipeline.py`
```python
    async def discover_platform_capabilities(self):
        """
        For each discovered URL, detect which platform it uses.
        This prepares optimal scraping strategies.
        """
        from discovery.platform_detector import detect_platform
        
        logger.info("Detecting platforms for discovered URLs...")
        
        silver_path = f"{settings.delta_lake_path}/silver/discovered_urls"
        urls_df = self.spark.read.format("delta").load(silver_path)
        
        enriched_urls = []
        for row in urls_df.take(urls_df.count()):
            row_dict = row.asDict()
            url = row_dict['url']
            
            # Detect platform
            platform = detect_platform(url)
            row_dict['platform'] = platform if platform else 'generic'
            row_dict['scraper_ready'] = platform is not None
            
            enriched_urls.append(row_dict)
        
        # Write back to Silver layer with platform info
        from pyspark.sql import Row
        enriched_df = self.spark.createDataFrame([Row(**u) for u in enriched_urls])
        enriched_df.write.format("delta").mode("overwrite").save(silver_path)
        
        logger.success(f"Platform detection complete - {len(enriched_urls)} URLs analyzed")
        
        return enriched_urls
```

---

## Next Steps

1. **Review Licenses** - All mentioned projects use permissive licenses (MIT/Apache 2.0), but double-check
2. **Clone Repos Locally** - Study their code structure:
   ```bash
   cd /tmp
   git clone https://github.com/biglocalnews/civic-scraper
   git clone https://github.com/city-scrapers/city-scrapers
   ```
3. **Add Attribution** - In your `README.md`, credit these projects
4. **Start with Platform Detector** - Implement `discovery/platform_detector.py` first
5. **Test with Your 76 URLs** - Run platform detection on your discovered URLs

---

## Resources

- **Civic Scraper Docs**: https://github.com/biglocalnews/civic-scraper/wiki
- **City Scrapers Tutorial**: https://cityscrapers.org/docs/development/
- **CDP Architecture**: https://councildataproject.org/
- **Legistar API Docs**: https://webapi.legistar.com/Home/Examples

---

## Questions to Consider

1. **Do you want video transcript support?** (CDP pattern, requires AWS/GCP credits)
2. **How important is real-time tracking?** (vs batch processing)
3. **Will you expose a public API?** (Councilmatic patterns useful here)
4. **Need to track voting records?** (Councilmatic person/vote models)

Let me know which phase you want to implement first!