Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /INTEGRATION_GUIDE.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

16.3 kB

	# Integration Guide: Reusing Open-Source Municipal Scraping Logic

	## Overview
	This guide shows how to integrate proven patterns from established open-source projects into the Oral Health Policy Pulse scraping pipeline.

	## Current State
	✅ You already have:
	- Census Gazetteer data with 85,302 jurisdictions (names + FIPS codes)
	- GSA .gov domain matching
	- 76 discovered URLs ready for scraping
	- Legistar platform references in codebase
	- Base ScraperAgent class in `agents/scraper.py`

	---

	## 1. Civic Scraper Integration
	Repository: `biglocalnews/civic-scraper`
	License: Apache 2.0 (✅ Compatible)

	### What to Adopt:
	#### A. Platform Detection Logic
	```python
	# They have excellent platform detection
	# Location: civic_scraper/platforms/__init__.py

	PLATFORMS = {
	'legistar': LegistarScraper,
	'granicus': GranicusScraper,
	'calagenda': CalAgendaScraper,
	'civicplus': CivicPlusScraper
	}

	def detect_platform(url: str) -> Optional[str]:
	"""Auto-detect which platform a URL uses"""
	if 'legistar.com' in url or '/Legistar/' in url:
	return 'legistar'
	elif 'granicus.com' in url or '/Mediasite/' in url:
	return 'granicus'
	# ... more patterns
	```

	Your Action: Add `discovery/platform_detector.py` using their patterns

	#### B. Document Downloader with Retry Logic
	```python
	# civic_scraper/download.py has robust downloading
	# Features:
	# - Exponential backoff
	# - Content-type validation
	# - Duplicate detection via hash
	# - Progress tracking

	async def download_document(url: str, session: httpx.AsyncClient) -> bytes:
	"""Download with retries and validation"""
	for attempt in range(3):
	try:
	response = await session.get(url, timeout=30.0)
	response.raise_for_status()

	# Validate it's actually a document
	content_type = response.headers.get('content-type', '')
	if 'pdf' in content_type or 'html' in content_type:
	return response.content
	except Exception as e:
	if attempt == 2:
	raise
	await asyncio.sleep(2 ** attempt)
	```

	Your Action: Enhance `agents/scraper.py` with their retry patterns

	---

	## 2. City Scrapers Integration
	Repository: `city-scrapers/city-scrapers`
	License: MIT (✅ Compatible)

	### What to Adopt:
	#### A. Standardized Event Schema
	```python
	# They normalize all meeting data to a common format
	# city_scrapers/core/models.py

	@dataclass
	class Event:
	title: str
	description: str
	classification: str # "Board", "Commission", "Council"
	start: datetime
	end: Optional[datetime]
	all_day: bool
	location: Dict[str, Any]
	links: List[Dict[str, str]] # [{"title": "Agenda", "href": "..."}]
	source: str

	# Classification types they use:
	CLASSIFICATIONS = [
	"Board",
	"Commission",
	"Committee",
	"Council",
	"Town Hall",
	"Public Hearing"
	]
	```

	Your Action: Create `models/meeting_event.py` with this schema for your Silver layer

	#### B. Scraper Testing Framework
	```python
	# They have excellent test patterns
	# tests/test_scrapers.py

	def test_scraper():
	"""Test with frozen HTML responses"""
	scraper = CityScraper()

	# Use saved HTML files to avoid live requests during testing
	with open('tests/fixtures/sample_calendar.html') as f:
	results = scraper.parse(f.read())

	assert len(results) > 0
	assert results[0].title
	assert results[0].source
	```

	Your Action: Add `tests/fixtures/` directory with sample HTML from different platforms

	---

	## 3. Council Data Project (CDP) Integration
	Repository: `CouncilDataProject/cdp-scrapers`
	License: MIT (✅ Compatible)

	### What to Adopt:
	#### A. Generic Ingestion Pipeline
	```python
	# CDP has a beautiful generic scraper pipeline
	# cdp_scrapers/scraper_utils.py

	class IngestionModel:
	"""Standard format for ingested data"""
	sessions: List[Session] # Individual meetings

	@dataclass
	class Session:
	video_uri: Optional[str]
	session_datetime: datetime
	session_index: int
	caption_uri: Optional[str]

	@dataclass
	class EventMinutesItem:
	name: str
	minutes_item: MinutesItem


	def reduced_list(items: List[Any], key_attr: str) -> List[Any]:
	"""Deduplicate items by a key attribute"""
	seen = set()
	result = []
	for item in items:
	key = getattr(item, key_attr)
	if key not in seen:
	seen.add(key)
	result.append(item)
	return result
	```

	Your Action: Create `models/ingestion.py` based on their schemas

	#### B. Video Transcript Integration (Future)
	```python
	# CDP processes meeting videos into searchable transcripts
	# This is advanced but incredibly valuable

	# They use:
	# - AWS Transcribe / Google Speech-to-Text
	# - Sentence indexing with timestamps
	# - Speaker diarization (who said what)

	# You could add this in Phase 2 after document scraping works
	```

	Your Action: Document in `docs/ROADMAP.md` for future implementation

	---

	## 4. Engagic Integration
	Repository: `Engagic/engagic`
	License: Check repo (likely AGPL)

	### What to Adopt:
	#### A. "Matter" Tracking Across Meetings
	```python
	# Engagic tracks individual legislative items across meetings
	# This is PERFECT for oral health policy tracking

	@dataclass
	class Matter:
	matter_id: str
	matter_number: str # "Bill 2024-001"
	title: str
	type: str # "Ordinance", "Resolution", "Motion"
	first_introduced: datetime
	status: str # "Introduced", "Committee", "Passed", "Failed"
	votes: List[Vote]
	related_documents: List[str]

	# Track how a fluoridation ordinance evolves:
	# Meeting 1: Introduced (just mentioned in minutes)
	# Meeting 2: Committee review (document link added)
	# Meeting 3: Public hearing (comments recorded)
	# Meeting 4: Final vote (result captured)
	```

	Your Action: Create `models/matter.py` for tracking policy evolution

	#### B. LLM-Powered Document Parsing
	```python
	# Engagic uses LLMs to extract structure from "blob" PDFs
	# You already have OpenAI configured!

	async def extract_agenda_items(pdf_text: str) -> List[AgendaItem]:
	"""Use GPT to extract structured items from unstructured text"""
	prompt = """
	Extract agenda items from this meeting minutes text.
	For each item, identify:
	- Item number
	- Title
	- Description
	- Any votes or decisions
	- Keywords related to health, dental, fluoride, water, public health

	Return JSON array.
	"""

	response = await openai_client.chat.completions.create(
	model="gpt-4o-mini",
	messages=[
	{"role": "system", "content": "You extract structured data from government documents"},
	{"role": "user", "content": f"{prompt}\n\n{pdf_text}"}
	],
	response_format={"type": "json_object"}
	)

	return json.loads(response.choices[0].message.content)
	```

	Your Action: Add `extraction/llm_parser.py` using your existing OpenAI setup

	---

	## 5. Councilmatic Integration
	Repository: `datamade/councilmatic-starter-template`
	License: MIT (✅ Compatible)

	### What to Adopt:
	#### A. Person/Organization Tracking
	```python
	# Councilmatic tracks who voted on what
	# Useful for understanding power dynamics around oral health policy

	@dataclass
	class Person:
	name: str
	role: str # "Council Member", "Mayor", "Commissioner"
	district: Optional[str]
	party: Optional[str]

	@dataclass
	class Vote:
	motion: str
	option: str # "yes", "no", "abstain"
	person: Person
	date: datetime
	```

	Your Action: Add to `models/governance.py`

	#### B. Search Interface Patterns
	```python
	# They have excellent search UX
	# filters.py shows what users want:

	SEARCH_FILTERS = [
	"date_range",
	"topic", # ["health", "water", "budget"]
	"organization", # Which board/commission
	"document_type", # ["agenda", "minutes", "transcript"]
	"status", # ["pending", "passed", "failed"]
	]

	# Your FastAPI endpoints could mirror this
	@app.get("/api/search")
	async def search_documents(
	query: str,
	topics: List[str] = Query(default=["oral_health", "fluoridation"]),
	date_from: Optional[date] = None,
	date_to: Optional[date] = None,
	state: Optional[str] = None
	):
	"""Search scraped documents with filters"""
	# Query your Delta Lake Gold layer
	```

	Your Action: Add to `api/routes/search.py` (create if doesn't exist)

	---

	## Implementation Priorities

	### Phase 1: Foundation (Week 1)
	- [ ] Platform Detection - Add `discovery/platform_detector.py` from Civic Scraper patterns
	- [ ] Standardized Schema - Create `models/meeting_event.py` from City Scrapers
	- [ ] Enhanced Downloader - Improve `agents/scraper.py` retry logic

	### Phase 2: Scraping (Week 2-3)
	- [ ] Legistar Scraper - Implement full Legistar support using Civic Scraper patterns
	- [ ] Generic HTML Parser - Use BeautifulSoup patterns from City Scrapers
	- [ ] PDF Extraction - Add PyPDF2/pdfplumber support

	### Phase 3: Intelligence (Week 4)
	- [ ] LLM Parser - Add `extraction/llm_parser.py` from Engagic patterns
	- [ ] Matter Tracking - Create `models/matter.py` for policy evolution
	- [ ] Keyword Detection - Oral health, fluoridation, dental policy detection

	### Phase 4: Scale (Week 5+)
	- [ ] Test All 76 URLs - Run full scraper on discovered targets
	- [ ] Expand to All Municipalities - Process all 32,333 jurisdictions
	- [ ] Video Transcripts - CDP-style video processing (future)

	---

	## Code Snippets to Add Now

	### 1. Platform Detector
	File: `discovery/platform_detector.py`
	```python
	"""
	Platform detection for municipal websites.
	Based on patterns from biglocalnews/civic-scraper.
	"""
	from typing import Optional
	from urllib.parse import urlparse

	PLATFORM_PATTERNS = {
	'legistar': [
	'legistar.com',
	'/Legistar/',
	'/LegislationDetail.aspx',
	'/Calendar.aspx'
	],
	'granicus': [
	'granicus.com',
	'/Mediasite/',
	'/ViewPublisher.php'
	],
	'municode': [
	'municode.com',
	'/meeting_minutes'
	],
	'civicplus': [
	'civicplus.com',
	'/AgendaCenter/',
	'/DocumentCenter/'
	]
	}

	def detect_platform(url: str) -> Optional[str]:
	"""
	Detect which platform a municipality website uses.

	Args:
	url: Municipality website URL

	Returns:
	Platform name or None if unknown
	"""
	url_lower = url.lower()

	for platform, patterns in PLATFORM_PATTERNS.items():
	if any(pattern.lower() in url_lower for pattern in patterns):
	return platform

	return None


	def get_scraper_class(platform: str):
	"""Get appropriate scraper class for platform"""
	from scrapers.legistar import LegistarScraper
	from scrapers.granicus import GranicusScraper
	from scrapers.generic import GenericScraper

	scrapers = {
	'legistar': LegistarScraper,
	'granicus': GranicusScraper
	}

	return scrapers.get(platform, GenericScraper)
	```

	### 2. Meeting Event Model
	File: `models/meeting_event.py`
	```python
	"""
	Standardized meeting event model.
	Based on City Scrapers schema.
	"""
	from dataclasses import dataclass, field
	from datetime import datetime
	from typing import Optional, List, Dict, Any

	@dataclass
	class Location:
	name: str
	address: Optional[str] = None
	city: Optional[str] = None
	state: Optional[str] = None

	@dataclass
	class Link:
	title: str # "Agenda", "Minutes", "Video"
	href: str
	content_type: Optional[str] = None # "application/pdf", "text/html"

	@dataclass
	class MeetingEvent:
	"""
	Normalized representation of a government meeting.
	Compatible with City Scrapers format.
	"""
	# Core identification
	id: str # Hash of source_url + start_time
	title: str
	description: str
	classification: str # "Board", "Commission", "Council", "Committee"

	# Temporal
	start: datetime
	end: Optional[datetime] = None
	all_day: bool = False

	# Spatial
	location: Location

	# Content
	links: List[Link] = field(default_factory=list)
	source: str = "" # Original URL

	# Metadata
	jurisdiction_name: str = ""
	state_code: str = ""
	fips_code: Optional[str] = None
	scraped_at: datetime = field(default_factory=datetime.utcnow)

	# Health policy relevance (your special sauce!)
	oral_health_relevant: bool = False
	keywords_found: List[str] = field(default_factory=list)
	confidence_score: float = 0.0

	def to_dict(self) -> Dict[str, Any]:
	"""Convert to dictionary for Delta Lake storage"""
	return {
	'id': self.id,
	'title': self.title,
	'description': self.description,
	'classification': self.classification,
	'start': self.start.isoformat(),
	'end': self.end.isoformat() if self.end else None,
	'all_day': self.all_day,
	'location_name': self.location.name,
	'location_address': self.location.address,
	'links': [{'title': l.title, 'href': l.href} for l in self.links],
	'source': self.source,
	'jurisdiction_name': self.jurisdiction_name,
	'state_code': self.state_code,
	'fips_code': self.fips_code,
	'scraped_at': self.scraped_at.isoformat(),
	'oral_health_relevant': self.oral_health_relevant,
	'keywords_found': self.keywords_found,
	'confidence_score': self.confidence_score
	}
	```

	### 3. Enhanced Discovery Pipeline
	Add to: `discovery/discovery_pipeline.py`
	```python
	async def discover_platform_capabilities(self):
	"""
	For each discovered URL, detect which platform it uses.
	This prepares optimal scraping strategies.
	"""
	from discovery.platform_detector import detect_platform

	logger.info("Detecting platforms for discovered URLs...")

	silver_path = f"{settings.delta_lake_path}/silver/discovered_urls"
	urls_df = self.spark.read.format("delta").load(silver_path)

	enriched_urls = []
	for row in urls_df.take(urls_df.count()):
	row_dict = row.asDict()
	url = row_dict['url']

	# Detect platform
	platform = detect_platform(url)
	row_dict['platform'] = platform if platform else 'generic'
	row_dict['scraper_ready'] = platform is not None

	enriched_urls.append(row_dict)

	# Write back to Silver layer with platform info
	from pyspark.sql import Row
	enriched_df = self.spark.createDataFrame([Row(**u) for u in enriched_urls])
	enriched_df.write.format("delta").mode("overwrite").save(silver_path)

	logger.success(f"Platform detection complete - {len(enriched_urls)} URLs analyzed")

	return enriched_urls
	```

	---

	## Next Steps

	1. Review Licenses - All mentioned projects use permissive licenses (MIT/Apache 2.0), but double-check
	2. Clone Repos Locally - Study their code structure:
	```bash
	cd /tmp
	git clone https://github.com/biglocalnews/civic-scraper
	git clone https://github.com/city-scrapers/city-scrapers
	```
	3. Add Attribution - In your `README.md`, credit these projects
	4. Start with Platform Detector - Implement `discovery/platform_detector.py` first
	5. Test with Your 76 URLs - Run platform detection on your discovered URLs

	---

	## Resources

	- Civic Scraper Docs: https://github.com/biglocalnews/civic-scraper/wiki
	- City Scrapers Tutorial: https://cityscrapers.org/docs/development/
	- CDP Architecture: https://councildataproject.org/
	- Legistar API Docs: https://webapi.legistar.com/Home/Examples

	---

	## Questions to Consider

	1. Do you want video transcript support? (CDP pattern, requires AWS/GCP credits)
	2. How important is real-time tracking? (vs batch processing)
	3. Will you expose a public API? (Councilmatic patterns useful here)
	4. Need to track voting records? (Councilmatic person/vote models)

	Let me know which phase you want to implement first!