open-navigator / docs /CIVIC_TECH_URL_SOURCES.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# πŸ” Civic Tech Projects: URL Source Analysis
## Quick Summary
| Project | URL Sources? | Quantity | Status | Priority |
|---------|-------------|----------|--------|----------|
| **Civic Scraper** | ❌ No | 0 | Library only | N/A |
| **City Scrapers** | βœ… **YES** | 100-500 | βœ… **Integrated** | DONE βœ… |
| **Council Data Project** | βœ… **YES** | 20 cities | ⏳ Pending | πŸ”₯ HIGH |
| **Engagic** | ❌ No | 0 | Research project | N/A |
| **Councilmatic** | ⚠️ Maybe | ~6 | Not checked | 🟑 LOW |
| **MeetingBank** | βœ… **YES** | 1,366 | βœ… **Integrated** | DONE βœ… |
| **Open States** | βœ… **YES** | 50+ | βœ… **Integrated** | DONE βœ… |
---
## 1. Civic Scraper
### What It Is:
**Library** for scraping government documents, not a deployment or URL database.
### What We Use:
- βœ… Platform detection patterns (Legistar, Granicus, etc.)
- βœ… Document downloading logic
- βœ… Error handling patterns
### URL Sources:
❌ **NO URL LIST** - It's a Python library/toolkit, not a data collection project.
### Action:
βœ… **COMPLETE** - We integrated their patterns into [`discovery/platform_detector.py`](../discovery/platform_detector.py)
---
## 2. City Scrapers
### What It Is:
**Active scraping project** with 100+ validated agency URLs across 5 cities.
### Deployments:
1. **Chicago** (~100 agencies)
2. **Pittsburgh** (~30 agencies)
3. **Detroit** (~40 agencies)
4. **Cleveland** (~30 agencies)
5. **Los Angeles** (~50 agencies)
### URL Sources:
βœ… **YES - 100-500 VALIDATED URLs**
Each spider file contains `start_urls` with:
- Agency meeting pages
- Granicus video portals
- Legistar calendars
- PDF agendas/minutes
### Status:
βœ… **INTEGRATED** - [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py)
### To Run:
```bash
cd /home/developer/projects/open-navigator
source venv/bin/activate
python discovery/city_scrapers_urls.py
```
**Output**: `bronze/city_scrapers_urls` table with 100-500 validated URLs
---
## 3. Council Data Project (CDP)
### What It Is:
**End-to-end platform** with 20+ full deployments (transcripts, videos, search).
### Verified Deployments:
1. Seattle, WA
2. King County, WA
3. Portland, OR
4. Denver, CO
5. Boston, MA
6. Oakland, CA
7. Charlotte, NC
8. San JosΓ©, CA
9. Milwaukee, WI
10. Louisville, KY
11. Atlanta, GA
12. Pittsburgh, PA
13. Long Beach, CA
14. Alameda, CA
15. Los Angeles, CA
16. San Diego, CA
17. Austin, TX
18. Houston, TX
19. Richmond, CA
20. Spokane, WA
### URL Sources:
βœ… **YES - 20 PREMIUM CITIES**
Each CDP deployment has:
- **GitHub repo** with configuration
- **`cdp-backend` config** with source URLs
- **Video URLs** (YouTube, Granicus, custom)
- **Meeting pages** (official city websites)
### Where to Find URLs:
Each city has a repo like: `CouncilDataProject/cdp-CITY-backend`
Example for Seattle:
```bash
# Clone repo
git clone https://github.com/CouncilDataProject/cdp-seattle-backend
# Config file has source URLs
cat cdp_seattle_backend/cdp_seattle_backend_pipeline.py
```
Contains patterns like:
```python
SCRAPER_CONFIG = {
"source_url": "https://seattle.gov/city-council/calendar",
"video_source": "https://www.seattlechannel.org/CouncilVideos",
"granicus_site": "https://seattle.granicus.com/ViewPublisher.php?view_id=24"
}
```
### Status:
⏳ **PENDING** - We have the list of 20 cities but haven't extracted URLs yet
### Action Needed:
Create `discovery/cdp_url_extraction.py` to:
1. Clone each CDP city's backend repo
2. Extract source URLs from config files
3. Write to `bronze/cdp_source_urls`
**Priority**: πŸ”₯ **HIGH** - These are premium quality URLs with full pipelines
---
## 4. Engagic
### What It Is:
**Research project** for LLM-based legislative text parsing.
### What We Use:
- βœ… Matter tracking model (legislative items)
- βœ… LLM parsing patterns for PDFs
### URL Sources:
❌ **NO URL LIST** - It's a research/prototype project, not a production scraper.
### Status:
βœ… **COMPLETE** - We created the Matter model in [`models/meeting_event.py`](../models/meeting_event.py)
### Action:
βœ… **DONE** - Model sufficient, no URLs to extract
---
## 5. Councilmatic
### What It Is:
**Django web app template** for city council tracking (search, voting records).
### Known Deployments:
1. **Chicago Councilmatic** - https://chicago.councilmatic.org
2. **New York City Councilmatic** - https://nyc.councilmatic.org
3. **Los Angeles Councilmatic** - https://la.councilmatic.org
4. **Philadelphia Councilmatic** - https://philly.councilmatic.org
5. **San Francisco Councilmatic** - (archived)
6. **Metro Councilmatic** (LA County) - https://metro.councilmatic.org
### URL Sources:
⚠️ **MAYBE - ~6 DEPLOYMENTS**
Each deployment uses **Legistar API** as their data source, so we'd get:
- Legistar API endpoints (already accessible)
- Meeting URLs (already in Legistar)
- Legislation URLs (already in Legistar)
### Issue:
**Redundant** - Councilmatic scrapes Legistar, which we already have access to.
We can enumerate Legistar directly without going through Councilmatic:
```python
# Already in our codebase
enumerate_legistar_subdomains() # Tests chicago.legistar.com, la.legistar.com, etc.
```
### Status:
πŸ“‹ **PLANNED** - Low priority, Legistar enumeration more efficient
### Action:
🟑 **LOW PRIORITY** - Skip for now, Legistar enumeration covers these cities
---
## 🎯 Recommended Next Steps
### Immediate (This Week):
1. βœ… **DONE**: City Scrapers URL extraction
2. πŸ”₯ **DO NEXT**: CDP URL extraction (20 premium cities)
3. ⏳ **PENDING**: MeetingBank ingestion (if not run yet)
4. ⏳ **PENDING**: Open States integration (if not run yet)
### Near-Term (Next 2 Weeks):
5. **Legistar enumeration** - Test {city}.legistar.com pattern against Census
6. **LocalView download** - Manual download from Harvard Dataverse
7. **URL deduplication** - Combine all sources, remove duplicates
### Long-Term (Next Month):
8. **Actual scrapers** - Build Legistar/Granicus/CivicPlus scrapers
9. **Transcript extraction** - YouTube captions, PDF parsing
10. **Oral health detection** - Run keyword matching on transcripts
---
## πŸ“Š Expected Coverage After All Integrations
| Source | URLs | Quality | Status |
|--------|------|---------|--------|
| Census Discovery | 76 | Variable | βœ… Working |
| City Scrapers | 100-500 | Good | βœ… Integrated |
| CDP | 20 | Excellent | ⏳ Pending |
| MeetingBank | 1,366 | Excellent | βœ… Integrated |
| Open States | 50-100 | Excellent | βœ… Integrated |
| LocalView | 1,000-10,000 | Good | ⏳ Manual download |
| Legistar Enum | 1,000-3,000 | Good | πŸ“‹ Planned |
| **TOTAL** | **7,000-20,000** | **High** | **In Progress** |
---
## πŸ’‘ Why Some Projects Don't Have URLs
### Civic Scraper:
It's a **library/toolkit**, like BeautifulSoup or Scrapy. You don't "extract URLs" from BeautifulSoup - you use it to build your own scrapers.
### Engagic:
It's a **research prototype** showing how to use LLMs to parse legislative documents. No production deployment = no URL database.
### Councilmatic:
It **consumes** Legistar data, doesn't produce new URLs. Going through Councilmatic to get Legistar URLs is like downloading a restaurant review site to find the restaurant's address - just go to the restaurant directly!
---
## βœ… Bottom Line
**YES, City Scrapers has URLs** - βœ… **Already integrated!**
**YES, CDP has URLs** - ⏳ **Next priority to extract**
**Others are libraries/research** - No URLs to extract, but we use their patterns
See [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py) for the City Scrapers integration that just got implemented! πŸŽ‰