Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 7,633 Bytes
896453f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | # π Civic Tech Projects: URL Source Analysis
## Quick Summary
| Project | URL Sources? | Quantity | Status | Priority |
|---------|-------------|----------|--------|----------|
| **Civic Scraper** | β No | 0 | Library only | N/A |
| **City Scrapers** | β
**YES** | 100-500 | β
**Integrated** | DONE β
|
| **Council Data Project** | β
**YES** | 20 cities | β³ Pending | π₯ HIGH |
| **Engagic** | β No | 0 | Research project | N/A |
| **Councilmatic** | β οΈ Maybe | ~6 | Not checked | π‘ LOW |
| **MeetingBank** | β
**YES** | 1,366 | β
**Integrated** | DONE β
|
| **Open States** | β
**YES** | 50+ | β
**Integrated** | DONE β
|
---
## 1. Civic Scraper
### What It Is:
**Library** for scraping government documents, not a deployment or URL database.
### What We Use:
- β
Platform detection patterns (Legistar, Granicus, etc.)
- β
Document downloading logic
- β
Error handling patterns
### URL Sources:
β **NO URL LIST** - It's a Python library/toolkit, not a data collection project.
### Action:
β
**COMPLETE** - We integrated their patterns into [`discovery/platform_detector.py`](../discovery/platform_detector.py)
---
## 2. City Scrapers
### What It Is:
**Active scraping project** with 100+ validated agency URLs across 5 cities.
### Deployments:
1. **Chicago** (~100 agencies)
2. **Pittsburgh** (~30 agencies)
3. **Detroit** (~40 agencies)
4. **Cleveland** (~30 agencies)
5. **Los Angeles** (~50 agencies)
### URL Sources:
β
**YES - 100-500 VALIDATED URLs**
Each spider file contains `start_urls` with:
- Agency meeting pages
- Granicus video portals
- Legistar calendars
- PDF agendas/minutes
### Status:
β
**INTEGRATED** - [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py)
### To Run:
```bash
cd /home/developer/projects/open-navigator
source venv/bin/activate
python discovery/city_scrapers_urls.py
```
**Output**: `bronze/city_scrapers_urls` table with 100-500 validated URLs
---
## 3. Council Data Project (CDP)
### What It Is:
**End-to-end platform** with 20+ full deployments (transcripts, videos, search).
### Verified Deployments:
1. Seattle, WA
2. King County, WA
3. Portland, OR
4. Denver, CO
5. Boston, MA
6. Oakland, CA
7. Charlotte, NC
8. San JosΓ©, CA
9. Milwaukee, WI
10. Louisville, KY
11. Atlanta, GA
12. Pittsburgh, PA
13. Long Beach, CA
14. Alameda, CA
15. Los Angeles, CA
16. San Diego, CA
17. Austin, TX
18. Houston, TX
19. Richmond, CA
20. Spokane, WA
### URL Sources:
β
**YES - 20 PREMIUM CITIES**
Each CDP deployment has:
- **GitHub repo** with configuration
- **`cdp-backend` config** with source URLs
- **Video URLs** (YouTube, Granicus, custom)
- **Meeting pages** (official city websites)
### Where to Find URLs:
Each city has a repo like: `CouncilDataProject/cdp-CITY-backend`
Example for Seattle:
```bash
# Clone repo
git clone https://github.com/CouncilDataProject/cdp-seattle-backend
# Config file has source URLs
cat cdp_seattle_backend/cdp_seattle_backend_pipeline.py
```
Contains patterns like:
```python
SCRAPER_CONFIG = {
"source_url": "https://seattle.gov/city-council/calendar",
"video_source": "https://www.seattlechannel.org/CouncilVideos",
"granicus_site": "https://seattle.granicus.com/ViewPublisher.php?view_id=24"
}
```
### Status:
β³ **PENDING** - We have the list of 20 cities but haven't extracted URLs yet
### Action Needed:
Create `discovery/cdp_url_extraction.py` to:
1. Clone each CDP city's backend repo
2. Extract source URLs from config files
3. Write to `bronze/cdp_source_urls`
**Priority**: π₯ **HIGH** - These are premium quality URLs with full pipelines
---
## 4. Engagic
### What It Is:
**Research project** for LLM-based legislative text parsing.
### What We Use:
- β
Matter tracking model (legislative items)
- β
LLM parsing patterns for PDFs
### URL Sources:
β **NO URL LIST** - It's a research/prototype project, not a production scraper.
### Status:
β
**COMPLETE** - We created the Matter model in [`models/meeting_event.py`](../models/meeting_event.py)
### Action:
β
**DONE** - Model sufficient, no URLs to extract
---
## 5. Councilmatic
### What It Is:
**Django web app template** for city council tracking (search, voting records).
### Known Deployments:
1. **Chicago Councilmatic** - https://chicago.councilmatic.org
2. **New York City Councilmatic** - https://nyc.councilmatic.org
3. **Los Angeles Councilmatic** - https://la.councilmatic.org
4. **Philadelphia Councilmatic** - https://philly.councilmatic.org
5. **San Francisco Councilmatic** - (archived)
6. **Metro Councilmatic** (LA County) - https://metro.councilmatic.org
### URL Sources:
β οΈ **MAYBE - ~6 DEPLOYMENTS**
Each deployment uses **Legistar API** as their data source, so we'd get:
- Legistar API endpoints (already accessible)
- Meeting URLs (already in Legistar)
- Legislation URLs (already in Legistar)
### Issue:
**Redundant** - Councilmatic scrapes Legistar, which we already have access to.
We can enumerate Legistar directly without going through Councilmatic:
```python
# Already in our codebase
enumerate_legistar_subdomains() # Tests chicago.legistar.com, la.legistar.com, etc.
```
### Status:
π **PLANNED** - Low priority, Legistar enumeration more efficient
### Action:
π‘ **LOW PRIORITY** - Skip for now, Legistar enumeration covers these cities
---
## π― Recommended Next Steps
### Immediate (This Week):
1. β
**DONE**: City Scrapers URL extraction
2. π₯ **DO NEXT**: CDP URL extraction (20 premium cities)
3. β³ **PENDING**: MeetingBank ingestion (if not run yet)
4. β³ **PENDING**: Open States integration (if not run yet)
### Near-Term (Next 2 Weeks):
5. **Legistar enumeration** - Test {city}.legistar.com pattern against Census
6. **LocalView download** - Manual download from Harvard Dataverse
7. **URL deduplication** - Combine all sources, remove duplicates
### Long-Term (Next Month):
8. **Actual scrapers** - Build Legistar/Granicus/CivicPlus scrapers
9. **Transcript extraction** - YouTube captions, PDF parsing
10. **Oral health detection** - Run keyword matching on transcripts
---
## π Expected Coverage After All Integrations
| Source | URLs | Quality | Status |
|--------|------|---------|--------|
| Census Discovery | 76 | Variable | β
Working |
| City Scrapers | 100-500 | Good | β
Integrated |
| CDP | 20 | Excellent | β³ Pending |
| MeetingBank | 1,366 | Excellent | β
Integrated |
| Open States | 50-100 | Excellent | β
Integrated |
| LocalView | 1,000-10,000 | Good | β³ Manual download |
| Legistar Enum | 1,000-3,000 | Good | π Planned |
| **TOTAL** | **7,000-20,000** | **High** | **In Progress** |
---
## π‘ Why Some Projects Don't Have URLs
### Civic Scraper:
It's a **library/toolkit**, like BeautifulSoup or Scrapy. You don't "extract URLs" from BeautifulSoup - you use it to build your own scrapers.
### Engagic:
It's a **research prototype** showing how to use LLMs to parse legislative documents. No production deployment = no URL database.
### Councilmatic:
It **consumes** Legistar data, doesn't produce new URLs. Going through Councilmatic to get Legistar URLs is like downloading a restaurant review site to find the restaurant's address - just go to the restaurant directly!
---
## β
Bottom Line
**YES, City Scrapers has URLs** - β
**Already integrated!**
**YES, CDP has URLs** - β³ **Next priority to extract**
**Others are libraries/research** - No URLs to extract, but we use their patterns
See [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py) for the City Scrapers integration that just got implemented! π
|