# Integration Guide: Reusing Open-Source Municipal Scraping Logic ## Overview This guide shows how to integrate proven patterns from established open-source projects into the Oral Health Policy Pulse scraping pipeline. ## Current State ✅ **You already have:** - Census Gazetteer data with 85,302 jurisdictions (names + FIPS codes) - GSA .gov domain matching - 76 discovered URLs ready for scraping - Legistar platform references in codebase - Base ScraperAgent class in `agents/scraper.py` --- ## 1. Civic Scraper Integration **Repository:** `biglocalnews/civic-scraper` **License:** Apache 2.0 (✅ Compatible) ### What to Adopt: #### A. Platform Detection Logic ```python # They have excellent platform detection # Location: civic_scraper/platforms/__init__.py PLATFORMS = { 'legistar': LegistarScraper, 'granicus': GranicusScraper, 'calagenda': CalAgendaScraper, 'civicplus': CivicPlusScraper } def detect_platform(url: str) -> Optional[str]: """Auto-detect which platform a URL uses""" if 'legistar.com' in url or '/Legistar/' in url: return 'legistar' elif 'granicus.com' in url or '/Mediasite/' in url: return 'granicus' # ... more patterns ``` **Your Action:** Add `discovery/platform_detector.py` using their patterns #### B. Document Downloader with Retry Logic ```python # civic_scraper/download.py has robust downloading # Features: # - Exponential backoff # - Content-type validation # - Duplicate detection via hash # - Progress tracking async def download_document(url: str, session: httpx.AsyncClient) -> bytes: """Download with retries and validation""" for attempt in range(3): try: response = await session.get(url, timeout=30.0) response.raise_for_status() # Validate it's actually a document content_type = response.headers.get('content-type', '') if 'pdf' in content_type or 'html' in content_type: return response.content except Exception as e: if attempt == 2: raise await asyncio.sleep(2 ** attempt) ``` **Your Action:** Enhance `agents/scraper.py` with their retry patterns --- ## 2. City Scrapers Integration **Repository:** `city-scrapers/city-scrapers` **License:** MIT (✅ Compatible) ### What to Adopt: #### A. Standardized Event Schema ```python # They normalize all meeting data to a common format # city_scrapers/core/models.py @dataclass class Event: title: str description: str classification: str # "Board", "Commission", "Council" start: datetime end: Optional[datetime] all_day: bool location: Dict[str, Any] links: List[Dict[str, str]] # [{"title": "Agenda", "href": "..."}] source: str # Classification types they use: CLASSIFICATIONS = [ "Board", "Commission", "Committee", "Council", "Town Hall", "Public Hearing" ] ``` **Your Action:** Create `models/meeting_event.py` with this schema for your Silver layer #### B. Scraper Testing Framework ```python # They have excellent test patterns # tests/test_scrapers.py def test_scraper(): """Test with frozen HTML responses""" scraper = CityScraper() # Use saved HTML files to avoid live requests during testing with open('tests/fixtures/sample_calendar.html') as f: results = scraper.parse(f.read()) assert len(results) > 0 assert results[0].title assert results[0].source ``` **Your Action:** Add `tests/fixtures/` directory with sample HTML from different platforms --- ## 3. Council Data Project (CDP) Integration **Repository:** `CouncilDataProject/cdp-scrapers` **License:** MIT (✅ Compatible) ### What to Adopt: #### A. Generic Ingestion Pipeline ```python # CDP has a beautiful generic scraper pipeline # cdp_scrapers/scraper_utils.py class IngestionModel: """Standard format for ingested data""" sessions: List[Session] # Individual meetings @dataclass class Session: video_uri: Optional[str] session_datetime: datetime session_index: int caption_uri: Optional[str] @dataclass class EventMinutesItem: name: str minutes_item: MinutesItem def reduced_list(items: List[Any], key_attr: str) -> List[Any]: """Deduplicate items by a key attribute""" seen = set() result = [] for item in items: key = getattr(item, key_attr) if key not in seen: seen.add(key) result.append(item) return result ``` **Your Action:** Create `models/ingestion.py` based on their schemas #### B. Video Transcript Integration (Future) ```python # CDP processes meeting videos into searchable transcripts # This is advanced but incredibly valuable # They use: # - AWS Transcribe / Google Speech-to-Text # - Sentence indexing with timestamps # - Speaker diarization (who said what) # You could add this in Phase 2 after document scraping works ``` **Your Action:** Document in `docs/ROADMAP.md` for future implementation --- ## 4. Engagic Integration **Repository:** `Engagic/engagic` **License:** Check repo (likely AGPL) ### What to Adopt: #### A. "Matter" Tracking Across Meetings ```python # Engagic tracks individual legislative items across meetings # This is PERFECT for oral health policy tracking @dataclass class Matter: matter_id: str matter_number: str # "Bill 2024-001" title: str type: str # "Ordinance", "Resolution", "Motion" first_introduced: datetime status: str # "Introduced", "Committee", "Passed", "Failed" votes: List[Vote] related_documents: List[str] # Track how a fluoridation ordinance evolves: # Meeting 1: Introduced (just mentioned in minutes) # Meeting 2: Committee review (document link added) # Meeting 3: Public hearing (comments recorded) # Meeting 4: Final vote (result captured) ``` **Your Action:** Create `models/matter.py` for tracking policy evolution #### B. LLM-Powered Document Parsing ```python # Engagic uses LLMs to extract structure from "blob" PDFs # You already have OpenAI configured! async def extract_agenda_items(pdf_text: str) -> List[AgendaItem]: """Use GPT to extract structured items from unstructured text""" prompt = """ Extract agenda items from this meeting minutes text. For each item, identify: - Item number - Title - Description - Any votes or decisions - Keywords related to health, dental, fluoride, water, public health Return JSON array. """ response = await openai_client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You extract structured data from government documents"}, {"role": "user", "content": f"{prompt}\n\n{pdf_text}"} ], response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content) ``` **Your Action:** Add `extraction/llm_parser.py` using your existing OpenAI setup --- ## 5. Councilmatic Integration **Repository:** `datamade/councilmatic-starter-template` **License:** MIT (✅ Compatible) ### What to Adopt: #### A. Person/Organization Tracking ```python # Councilmatic tracks who voted on what # Useful for understanding power dynamics around oral health policy @dataclass class Person: name: str role: str # "Council Member", "Mayor", "Commissioner" district: Optional[str] party: Optional[str] @dataclass class Vote: motion: str option: str # "yes", "no", "abstain" person: Person date: datetime ``` **Your Action:** Add to `models/governance.py` #### B. Search Interface Patterns ```python # They have excellent search UX # filters.py shows what users want: SEARCH_FILTERS = [ "date_range", "topic", # ["health", "water", "budget"] "organization", # Which board/commission "document_type", # ["agenda", "minutes", "transcript"] "status", # ["pending", "passed", "failed"] ] # Your FastAPI endpoints could mirror this @app.get("/api/search") async def search_documents( query: str, topics: List[str] = Query(default=["oral_health", "fluoridation"]), date_from: Optional[date] = None, date_to: Optional[date] = None, state: Optional[str] = None ): """Search scraped documents with filters""" # Query your Delta Lake Gold layer ``` **Your Action:** Add to `api/routes/search.py` (create if doesn't exist) --- ## Implementation Priorities ### Phase 1: Foundation (Week 1) - [ ] **Platform Detection** - Add `discovery/platform_detector.py` from Civic Scraper patterns - [ ] **Standardized Schema** - Create `models/meeting_event.py` from City Scrapers - [ ] **Enhanced Downloader** - Improve `agents/scraper.py` retry logic ### Phase 2: Scraping (Week 2-3) - [ ] **Legistar Scraper** - Implement full Legistar support using Civic Scraper patterns - [ ] **Generic HTML Parser** - Use BeautifulSoup patterns from City Scrapers - [ ] **PDF Extraction** - Add PyPDF2/pdfplumber support ### Phase 3: Intelligence (Week 4) - [ ] **LLM Parser** - Add `extraction/llm_parser.py` from Engagic patterns - [ ] **Matter Tracking** - Create `models/matter.py` for policy evolution - [ ] **Keyword Detection** - Oral health, fluoridation, dental policy detection ### Phase 4: Scale (Week 5+) - [ ] **Test All 76 URLs** - Run full scraper on discovered targets - [ ] **Expand to All Municipalities** - Process all 32,333 jurisdictions - [ ] **Video Transcripts** - CDP-style video processing (future) --- ## Code Snippets to Add Now ### 1. Platform Detector **File:** `discovery/platform_detector.py` ```python """ Platform detection for municipal websites. Based on patterns from biglocalnews/civic-scraper. """ from typing import Optional from urllib.parse import urlparse PLATFORM_PATTERNS = { 'legistar': [ 'legistar.com', '/Legistar/', '/LegislationDetail.aspx', '/Calendar.aspx' ], 'granicus': [ 'granicus.com', '/Mediasite/', '/ViewPublisher.php' ], 'municode': [ 'municode.com', '/meeting_minutes' ], 'civicplus': [ 'civicplus.com', '/AgendaCenter/', '/DocumentCenter/' ] } def detect_platform(url: str) -> Optional[str]: """ Detect which platform a municipality website uses. Args: url: Municipality website URL Returns: Platform name or None if unknown """ url_lower = url.lower() for platform, patterns in PLATFORM_PATTERNS.items(): if any(pattern.lower() in url_lower for pattern in patterns): return platform return None def get_scraper_class(platform: str): """Get appropriate scraper class for platform""" from scrapers.legistar import LegistarScraper from scrapers.granicus import GranicusScraper from scrapers.generic import GenericScraper scrapers = { 'legistar': LegistarScraper, 'granicus': GranicusScraper } return scrapers.get(platform, GenericScraper) ``` ### 2. Meeting Event Model **File:** `models/meeting_event.py` ```python """ Standardized meeting event model. Based on City Scrapers schema. """ from dataclasses import dataclass, field from datetime import datetime from typing import Optional, List, Dict, Any @dataclass class Location: name: str address: Optional[str] = None city: Optional[str] = None state: Optional[str] = None @dataclass class Link: title: str # "Agenda", "Minutes", "Video" href: str content_type: Optional[str] = None # "application/pdf", "text/html" @dataclass class MeetingEvent: """ Normalized representation of a government meeting. Compatible with City Scrapers format. """ # Core identification id: str # Hash of source_url + start_time title: str description: str classification: str # "Board", "Commission", "Council", "Committee" # Temporal start: datetime end: Optional[datetime] = None all_day: bool = False # Spatial location: Location # Content links: List[Link] = field(default_factory=list) source: str = "" # Original URL # Metadata jurisdiction_name: str = "" state_code: str = "" fips_code: Optional[str] = None scraped_at: datetime = field(default_factory=datetime.utcnow) # Health policy relevance (your special sauce!) oral_health_relevant: bool = False keywords_found: List[str] = field(default_factory=list) confidence_score: float = 0.0 def to_dict(self) -> Dict[str, Any]: """Convert to dictionary for Delta Lake storage""" return { 'id': self.id, 'title': self.title, 'description': self.description, 'classification': self.classification, 'start': self.start.isoformat(), 'end': self.end.isoformat() if self.end else None, 'all_day': self.all_day, 'location_name': self.location.name, 'location_address': self.location.address, 'links': [{'title': l.title, 'href': l.href} for l in self.links], 'source': self.source, 'jurisdiction_name': self.jurisdiction_name, 'state_code': self.state_code, 'fips_code': self.fips_code, 'scraped_at': self.scraped_at.isoformat(), 'oral_health_relevant': self.oral_health_relevant, 'keywords_found': self.keywords_found, 'confidence_score': self.confidence_score } ``` ### 3. Enhanced Discovery Pipeline **Add to:** `discovery/discovery_pipeline.py` ```python async def discover_platform_capabilities(self): """ For each discovered URL, detect which platform it uses. This prepares optimal scraping strategies. """ from discovery.platform_detector import detect_platform logger.info("Detecting platforms for discovered URLs...") silver_path = f"{settings.delta_lake_path}/silver/discovered_urls" urls_df = self.spark.read.format("delta").load(silver_path) enriched_urls = [] for row in urls_df.take(urls_df.count()): row_dict = row.asDict() url = row_dict['url'] # Detect platform platform = detect_platform(url) row_dict['platform'] = platform if platform else 'generic' row_dict['scraper_ready'] = platform is not None enriched_urls.append(row_dict) # Write back to Silver layer with platform info from pyspark.sql import Row enriched_df = self.spark.createDataFrame([Row(**u) for u in enriched_urls]) enriched_df.write.format("delta").mode("overwrite").save(silver_path) logger.success(f"Platform detection complete - {len(enriched_urls)} URLs analyzed") return enriched_urls ``` --- ## Next Steps 1. **Review Licenses** - All mentioned projects use permissive licenses (MIT/Apache 2.0), but double-check 2. **Clone Repos Locally** - Study their code structure: ```bash cd /tmp git clone https://github.com/biglocalnews/civic-scraper git clone https://github.com/city-scrapers/city-scrapers ``` 3. **Add Attribution** - In your `README.md`, credit these projects 4. **Start with Platform Detector** - Implement `discovery/platform_detector.py` first 5. **Test with Your 76 URLs** - Run platform detection on your discovered URLs --- ## Resources - **Civic Scraper Docs**: https://github.com/biglocalnews/civic-scraper/wiki - **City Scrapers Tutorial**: https://cityscrapers.org/docs/development/ - **CDP Architecture**: https://councildataproject.org/ - **Legistar API Docs**: https://webapi.legistar.com/Home/Examples --- ## Questions to Consider 1. **Do you want video transcript support?** (CDP pattern, requires AWS/GCP credits) 2. **How important is real-time tracking?** (vs batch processing) 3. **Will you expose a public API?** (Councilmatic patterns useful here) 4. **Need to track voting records?** (Councilmatic person/vote models) Let me know which phase you want to implement first!