Spaces:
Running on CPU Upgrade
π Civic Tech Projects: URL Source Analysis
Quick Summary
| Project | URL Sources? | Quantity | Status | Priority |
|---|---|---|---|---|
| Civic Scraper | β No | 0 | Library only | N/A |
| City Scrapers | β YES | 100-500 | β Integrated | DONE β |
| Council Data Project | β YES | 20 cities | β³ Pending | π₯ HIGH |
| Engagic | β No | 0 | Research project | N/A |
| Councilmatic | β οΈ Maybe | ~6 | Not checked | π‘ LOW |
| MeetingBank | β YES | 1,366 | β Integrated | DONE β |
| Open States | β YES | 50+ | β Integrated | DONE β |
1. Civic Scraper
What It Is:
Library for scraping government documents, not a deployment or URL database.
What We Use:
- β Platform detection patterns (Legistar, Granicus, etc.)
- β Document downloading logic
- β Error handling patterns
URL Sources:
β NO URL LIST - It's a Python library/toolkit, not a data collection project.
Action:
β
COMPLETE - We integrated their patterns into discovery/platform_detector.py
2. City Scrapers
What It Is:
Active scraping project with 100+ validated agency URLs across 5 cities.
Deployments:
- Chicago (~100 agencies)
- Pittsburgh (~30 agencies)
- Detroit (~40 agencies)
- Cleveland (~30 agencies)
- Los Angeles (~50 agencies)
URL Sources:
β YES - 100-500 VALIDATED URLs
Each spider file contains start_urls with:
- Agency meeting pages
- Granicus video portals
- Legistar calendars
- PDF agendas/minutes
Status:
β
INTEGRATED - discovery/city_scrapers_urls.py
To Run:
cd /home/developer/projects/open-navigator
source venv/bin/activate
python discovery/city_scrapers_urls.py
Output: bronze/city_scrapers_urls table with 100-500 validated URLs
3. Council Data Project (CDP)
What It Is:
End-to-end platform with 20+ full deployments (transcripts, videos, search).
Verified Deployments:
- Seattle, WA
- King County, WA
- Portland, OR
- Denver, CO
- Boston, MA
- Oakland, CA
- Charlotte, NC
- San JosΓ©, CA
- Milwaukee, WI
- Louisville, KY
- Atlanta, GA
- Pittsburgh, PA
- Long Beach, CA
- Alameda, CA
- Los Angeles, CA
- San Diego, CA
- Austin, TX
- Houston, TX
- Richmond, CA
- Spokane, WA
URL Sources:
β YES - 20 PREMIUM CITIES
Each CDP deployment has:
- GitHub repo with configuration
cdp-backendconfig with source URLs- Video URLs (YouTube, Granicus, custom)
- Meeting pages (official city websites)
Where to Find URLs:
Each city has a repo like: CouncilDataProject/cdp-CITY-backend
Example for Seattle:
# Clone repo
git clone https://github.com/CouncilDataProject/cdp-seattle-backend
# Config file has source URLs
cat cdp_seattle_backend/cdp_seattle_backend_pipeline.py
Contains patterns like:
SCRAPER_CONFIG = {
"source_url": "https://seattle.gov/city-council/calendar",
"video_source": "https://www.seattlechannel.org/CouncilVideos",
"granicus_site": "https://seattle.granicus.com/ViewPublisher.php?view_id=24"
}
Status:
β³ PENDING - We have the list of 20 cities but haven't extracted URLs yet
Action Needed:
Create discovery/cdp_url_extraction.py to:
- Clone each CDP city's backend repo
- Extract source URLs from config files
- Write to
bronze/cdp_source_urls
Priority: π₯ HIGH - These are premium quality URLs with full pipelines
4. Engagic
What It Is:
Research project for LLM-based legislative text parsing.
What We Use:
- β Matter tracking model (legislative items)
- β LLM parsing patterns for PDFs
URL Sources:
β NO URL LIST - It's a research/prototype project, not a production scraper.
Status:
β
COMPLETE - We created the Matter model in models/meeting_event.py
Action:
β DONE - Model sufficient, no URLs to extract
5. Councilmatic
What It Is:
Django web app template for city council tracking (search, voting records).
Known Deployments:
- Chicago Councilmatic - https://chicago.councilmatic.org
- New York City Councilmatic - https://nyc.councilmatic.org
- Los Angeles Councilmatic - https://la.councilmatic.org
- Philadelphia Councilmatic - https://philly.councilmatic.org
- San Francisco Councilmatic - (archived)
- Metro Councilmatic (LA County) - https://metro.councilmatic.org
URL Sources:
β οΈ MAYBE - ~6 DEPLOYMENTS
Each deployment uses Legistar API as their data source, so we'd get:
- Legistar API endpoints (already accessible)
- Meeting URLs (already in Legistar)
- Legislation URLs (already in Legistar)
Issue:
Redundant - Councilmatic scrapes Legistar, which we already have access to.
We can enumerate Legistar directly without going through Councilmatic:
# Already in our codebase
enumerate_legistar_subdomains() # Tests chicago.legistar.com, la.legistar.com, etc.
Status:
π PLANNED - Low priority, Legistar enumeration more efficient
Action:
π‘ LOW PRIORITY - Skip for now, Legistar enumeration covers these cities
π― Recommended Next Steps
Immediate (This Week):
- β DONE: City Scrapers URL extraction
- π₯ DO NEXT: CDP URL extraction (20 premium cities)
- β³ PENDING: MeetingBank ingestion (if not run yet)
- β³ PENDING: Open States integration (if not run yet)
Near-Term (Next 2 Weeks):
- Legistar enumeration - Test {city}.legistar.com pattern against Census
- LocalView download - Manual download from Harvard Dataverse
- URL deduplication - Combine all sources, remove duplicates
Long-Term (Next Month):
- Actual scrapers - Build Legistar/Granicus/CivicPlus scrapers
- Transcript extraction - YouTube captions, PDF parsing
- Oral health detection - Run keyword matching on transcripts
π Expected Coverage After All Integrations
| Source | URLs | Quality | Status |
|---|---|---|---|
| Census Discovery | 76 | Variable | β Working |
| City Scrapers | 100-500 | Good | β Integrated |
| CDP | 20 | Excellent | β³ Pending |
| MeetingBank | 1,366 | Excellent | β Integrated |
| Open States | 50-100 | Excellent | β Integrated |
| LocalView | 1,000-10,000 | Good | β³ Manual download |
| Legistar Enum | 1,000-3,000 | Good | π Planned |
| TOTAL | 7,000-20,000 | High | In Progress |
π‘ Why Some Projects Don't Have URLs
Civic Scraper:
It's a library/toolkit, like BeautifulSoup or Scrapy. You don't "extract URLs" from BeautifulSoup - you use it to build your own scrapers.
Engagic:
It's a research prototype showing how to use LLMs to parse legislative documents. No production deployment = no URL database.
Councilmatic:
It consumes Legistar data, doesn't produce new URLs. Going through Councilmatic to get Legistar URLs is like downloading a restaurant review site to find the restaurant's address - just go to the restaurant directly!
β Bottom Line
YES, City Scrapers has URLs - β Already integrated!
YES, CDP has URLs - β³ Next priority to extract
Others are libraries/research - No URLs to extract, but we use their patterns
See discovery/city_scrapers_urls.py for the City Scrapers integration that just got implemented! π