Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /CIVIC_TECH_URL_SOURCES.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

7.63 kB

🔍 Civic Tech Projects: URL Source Analysis

Quick Summary

Project	URL Sources?	Quantity	Status	Priority
Civic Scraper	❌ No	0	Library only	N/A
City Scrapers	✅ YES	100-500	✅ Integrated	DONE ✅
Council Data Project	✅ YES	20 cities	⏳ Pending	🔥 HIGH
Engagic	❌ No	0	Research project	N/A
Councilmatic	⚠️ Maybe	~6	Not checked	🟡 LOW
MeetingBank	✅ YES	1,366	✅ Integrated	DONE ✅
Open States	✅ YES	50+	✅ Integrated	DONE ✅

1. Civic Scraper

What It Is:

Library for scraping government documents, not a deployment or URL database.

What We Use:

✅ Platform detection patterns (Legistar, Granicus, etc.)
✅ Document downloading logic
✅ Error handling patterns

URL Sources:

❌ NO URL LIST - It's a Python library/toolkit, not a data collection project.

Action:

✅ COMPLETE - We integrated their patterns into discovery/platform_detector.py

2. City Scrapers

What It Is:

Active scraping project with 100+ validated agency URLs across 5 cities.

Deployments:

Chicago (~100 agencies)
Pittsburgh (~30 agencies)
Detroit (~40 agencies)
Cleveland (~30 agencies)
Los Angeles (~50 agencies)

URL Sources:

✅ YES - 100-500 VALIDATED URLs

Each spider file contains start_urls with:

Agency meeting pages
Granicus video portals
Legistar calendars
PDF agendas/minutes

Status:

✅ INTEGRATED - discovery/city_scrapers_urls.py

To Run:

cd /home/developer/projects/open-navigator
source venv/bin/activate
python discovery/city_scrapers_urls.py

Output: bronze/city_scrapers_urls table with 100-500 validated URLs

3. Council Data Project (CDP)

What It Is:

End-to-end platform with 20+ full deployments (transcripts, videos, search).

Verified Deployments:

Seattle, WA
King County, WA
Portland, OR
Denver, CO
Boston, MA
Oakland, CA
Charlotte, NC
San José, CA
Milwaukee, WI
Louisville, KY
Atlanta, GA
Pittsburgh, PA
Long Beach, CA
Alameda, CA
Los Angeles, CA
San Diego, CA
Austin, TX
Houston, TX
Richmond, CA
Spokane, WA

URL Sources:

✅ YES - 20 PREMIUM CITIES

Each CDP deployment has:

GitHub repo with configuration
cdp-backend config with source URLs
Video URLs (YouTube, Granicus, custom)
Meeting pages (official city websites)

Where to Find URLs:

Each city has a repo like: CouncilDataProject/cdp-CITY-backend

Example for Seattle:

# Clone repo
git clone https://github.com/CouncilDataProject/cdp-seattle-backend

# Config file has source URLs
cat cdp_seattle_backend/cdp_seattle_backend_pipeline.py

Contains patterns like:

SCRAPER_CONFIG = {
    "source_url": "https://seattle.gov/city-council/calendar",
    "video_source": "https://www.seattlechannel.org/CouncilVideos",
    "granicus_site": "https://seattle.granicus.com/ViewPublisher.php?view_id=24"
}

Status:

⏳ PENDING - We have the list of 20 cities but haven't extracted URLs yet

Action Needed:

Create discovery/cdp_url_extraction.py to:

Clone each CDP city's backend repo
Extract source URLs from config files
Write to bronze/cdp_source_urls

Priority: 🔥 HIGH - These are premium quality URLs with full pipelines

4. Engagic

What It Is:

Research project for LLM-based legislative text parsing.

What We Use:

✅ Matter tracking model (legislative items)
✅ LLM parsing patterns for PDFs

URL Sources:

❌ NO URL LIST - It's a research/prototype project, not a production scraper.

Status:

✅ COMPLETE - We created the Matter model in models/meeting_event.py

Action:

✅ DONE - Model sufficient, no URLs to extract

5. Councilmatic

What It Is:

Django web app template for city council tracking (search, voting records).

Known Deployments:

Chicago Councilmatic - https://chicago.councilmatic.org
New York City Councilmatic - https://nyc.councilmatic.org
Los Angeles Councilmatic - https://la.councilmatic.org
Philadelphia Councilmatic - https://philly.councilmatic.org
San Francisco Councilmatic - (archived)
Metro Councilmatic (LA County) - https://metro.councilmatic.org

URL Sources:

⚠️ MAYBE - ~6 DEPLOYMENTS

Each deployment uses Legistar API as their data source, so we'd get:

Legistar API endpoints (already accessible)
Meeting URLs (already in Legistar)
Legislation URLs (already in Legistar)

Issue:

Redundant - Councilmatic scrapes Legistar, which we already have access to.

We can enumerate Legistar directly without going through Councilmatic:

# Already in our codebase
enumerate_legistar_subdomains()  # Tests chicago.legistar.com, la.legistar.com, etc.

Status:

📋 PLANNED - Low priority, Legistar enumeration more efficient

Action:

🟡 LOW PRIORITY - Skip for now, Legistar enumeration covers these cities

🎯 Recommended Next Steps

Immediate (This Week):

✅ DONE: City Scrapers URL extraction
🔥 DO NEXT: CDP URL extraction (20 premium cities)
⏳ PENDING: MeetingBank ingestion (if not run yet)
⏳ PENDING: Open States integration (if not run yet)

Near-Term (Next 2 Weeks):

Legistar enumeration - Test {city}.legistar.com pattern against Census
LocalView download - Manual download from Harvard Dataverse
URL deduplication - Combine all sources, remove duplicates

Long-Term (Next Month):

Actual scrapers - Build Legistar/Granicus/CivicPlus scrapers
Transcript extraction - YouTube captions, PDF parsing
Oral health detection - Run keyword matching on transcripts

📊 Expected Coverage After All Integrations

Source	URLs	Quality	Status
Census Discovery	76	Variable	✅ Working
City Scrapers	100-500	Good	✅ Integrated
CDP	20	Excellent	⏳ Pending
MeetingBank	1,366	Excellent	✅ Integrated
Open States	50-100	Excellent	✅ Integrated
LocalView	1,000-10,000	Good	⏳ Manual download
Legistar Enum	1,000-3,000	Good	📋 Planned
TOTAL	7,000-20,000	High	In Progress

💡 Why Some Projects Don't Have URLs

Civic Scraper:

It's a library/toolkit, like BeautifulSoup or Scrapy. You don't "extract URLs" from BeautifulSoup - you use it to build your own scrapers.

Engagic:

It's a research prototype showing how to use LLMs to parse legislative documents. No production deployment = no URL database.

Councilmatic:

It consumes Legistar data, doesn't produce new URLs. Going through Councilmatic to get Legistar URLs is like downloading a restaurant review site to find the restaurant's address - just go to the restaurant directly!

✅ Bottom Line

YES, City Scrapers has URLs - ✅ Already integrated!

YES, CDP has URLs - ⏳ Next priority to extract

Others are libraries/research - No URLs to extract, but we use their patterns

See discovery/city_scrapers_urls.py for the City Scrapers integration that just got implemented! 🎉