File size: 7,633 Bytes
896453f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# πŸ” Civic Tech Projects: URL Source Analysis

## Quick Summary

| Project | URL Sources? | Quantity | Status | Priority |
|---------|-------------|----------|--------|----------|
| **Civic Scraper** | ❌ No | 0 | Library only | N/A |
| **City Scrapers** | βœ… **YES** | 100-500 | βœ… **Integrated** | DONE βœ… |
| **Council Data Project** | βœ… **YES** | 20 cities | ⏳ Pending | πŸ”₯ HIGH |
| **Engagic** | ❌ No | 0 | Research project | N/A |
| **Councilmatic** | ⚠️ Maybe | ~6 | Not checked | 🟑 LOW |
| **MeetingBank** | βœ… **YES** | 1,366 | βœ… **Integrated** | DONE βœ… |
| **Open States** | βœ… **YES** | 50+ | βœ… **Integrated** | DONE βœ… |

---

## 1. Civic Scraper

### What It Is:
**Library** for scraping government documents, not a deployment or URL database.

### What We Use:
- βœ… Platform detection patterns (Legistar, Granicus, etc.)
- βœ… Document downloading logic
- βœ… Error handling patterns

### URL Sources:
❌ **NO URL LIST** - It's a Python library/toolkit, not a data collection project.

### Action:
βœ… **COMPLETE** - We integrated their patterns into [`discovery/platform_detector.py`](../discovery/platform_detector.py)

---

## 2. City Scrapers

### What It Is:
**Active scraping project** with 100+ validated agency URLs across 5 cities.

### Deployments:
1. **Chicago** (~100 agencies)
2. **Pittsburgh** (~30 agencies)
3. **Detroit** (~40 agencies)
4. **Cleveland** (~30 agencies)
5. **Los Angeles** (~50 agencies)

### URL Sources:
βœ… **YES - 100-500 VALIDATED URLs**

Each spider file contains `start_urls` with:
- Agency meeting pages
- Granicus video portals
- Legistar calendars
- PDF agendas/minutes

### Status:
βœ… **INTEGRATED** - [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py)

### To Run:
```bash
cd /home/developer/projects/open-navigator
source venv/bin/activate
python discovery/city_scrapers_urls.py
```

**Output**: `bronze/city_scrapers_urls` table with 100-500 validated URLs

---

## 3. Council Data Project (CDP)

### What It Is:
**End-to-end platform** with 20+ full deployments (transcripts, videos, search).

### Verified Deployments:
1. Seattle, WA
2. King County, WA
3. Portland, OR
4. Denver, CO
5. Boston, MA
6. Oakland, CA
7. Charlotte, NC
8. San JosΓ©, CA
9. Milwaukee, WI
10. Louisville, KY
11. Atlanta, GA
12. Pittsburgh, PA
13. Long Beach, CA
14. Alameda, CA
15. Los Angeles, CA
16. San Diego, CA
17. Austin, TX
18. Houston, TX
19. Richmond, CA
20. Spokane, WA

### URL Sources:
βœ… **YES - 20 PREMIUM CITIES**

Each CDP deployment has:
- **GitHub repo** with configuration
- **`cdp-backend` config** with source URLs
- **Video URLs** (YouTube, Granicus, custom)
- **Meeting pages** (official city websites)

### Where to Find URLs:
Each city has a repo like: `CouncilDataProject/cdp-CITY-backend`

Example for Seattle:
```bash
# Clone repo
git clone https://github.com/CouncilDataProject/cdp-seattle-backend

# Config file has source URLs
cat cdp_seattle_backend/cdp_seattle_backend_pipeline.py
```

Contains patterns like:
```python
SCRAPER_CONFIG = {
    "source_url": "https://seattle.gov/city-council/calendar",
    "video_source": "https://www.seattlechannel.org/CouncilVideos",
    "granicus_site": "https://seattle.granicus.com/ViewPublisher.php?view_id=24"
}
```

### Status:
⏳ **PENDING** - We have the list of 20 cities but haven't extracted URLs yet

### Action Needed:
Create `discovery/cdp_url_extraction.py` to:
1. Clone each CDP city's backend repo
2. Extract source URLs from config files
3. Write to `bronze/cdp_source_urls`

**Priority**: πŸ”₯ **HIGH** - These are premium quality URLs with full pipelines

---

## 4. Engagic

### What It Is:
**Research project** for LLM-based legislative text parsing.

### What We Use:
- βœ… Matter tracking model (legislative items)
- βœ… LLM parsing patterns for PDFs

### URL Sources:
❌ **NO URL LIST** - It's a research/prototype project, not a production scraper.

### Status:
βœ… **COMPLETE** - We created the Matter model in [`models/meeting_event.py`](../models/meeting_event.py)

### Action:
βœ… **DONE** - Model sufficient, no URLs to extract

---

## 5. Councilmatic

### What It Is:
**Django web app template** for city council tracking (search, voting records).

### Known Deployments:
1. **Chicago Councilmatic** - https://chicago.councilmatic.org
2. **New York City Councilmatic** - https://nyc.councilmatic.org
3. **Los Angeles Councilmatic** - https://la.councilmatic.org
4. **Philadelphia Councilmatic** - https://philly.councilmatic.org
5. **San Francisco Councilmatic** - (archived)
6. **Metro Councilmatic** (LA County) - https://metro.councilmatic.org

### URL Sources:
⚠️ **MAYBE - ~6 DEPLOYMENTS**

Each deployment uses **Legistar API** as their data source, so we'd get:
- Legistar API endpoints (already accessible)
- Meeting URLs (already in Legistar)
- Legislation URLs (already in Legistar)

### Issue:
**Redundant** - Councilmatic scrapes Legistar, which we already have access to.

We can enumerate Legistar directly without going through Councilmatic:
```python
# Already in our codebase
enumerate_legistar_subdomains()  # Tests chicago.legistar.com, la.legistar.com, etc.
```

### Status:
πŸ“‹ **PLANNED** - Low priority, Legistar enumeration more efficient

### Action:
🟑 **LOW PRIORITY** - Skip for now, Legistar enumeration covers these cities

---

## 🎯 Recommended Next Steps

### Immediate (This Week):
1. βœ… **DONE**: City Scrapers URL extraction
2. πŸ”₯ **DO NEXT**: CDP URL extraction (20 premium cities)
3. ⏳ **PENDING**: MeetingBank ingestion (if not run yet)
4. ⏳ **PENDING**: Open States integration (if not run yet)

### Near-Term (Next 2 Weeks):
5. **Legistar enumeration** - Test {city}.legistar.com pattern against Census
6. **LocalView download** - Manual download from Harvard Dataverse
7. **URL deduplication** - Combine all sources, remove duplicates

### Long-Term (Next Month):
8. **Actual scrapers** - Build Legistar/Granicus/CivicPlus scrapers
9. **Transcript extraction** - YouTube captions, PDF parsing
10. **Oral health detection** - Run keyword matching on transcripts

---

## πŸ“Š Expected Coverage After All Integrations

| Source | URLs | Quality | Status |
|--------|------|---------|--------|
| Census Discovery | 76 | Variable | βœ… Working |
| City Scrapers | 100-500 | Good | βœ… Integrated |
| CDP | 20 | Excellent | ⏳ Pending |
| MeetingBank | 1,366 | Excellent | βœ… Integrated |
| Open States | 50-100 | Excellent | βœ… Integrated |
| LocalView | 1,000-10,000 | Good | ⏳ Manual download |
| Legistar Enum | 1,000-3,000 | Good | πŸ“‹ Planned |
| **TOTAL** | **7,000-20,000** | **High** | **In Progress** |

---

## πŸ’‘ Why Some Projects Don't Have URLs

### Civic Scraper:
It's a **library/toolkit**, like BeautifulSoup or Scrapy. You don't "extract URLs" from BeautifulSoup - you use it to build your own scrapers.

### Engagic:
It's a **research prototype** showing how to use LLMs to parse legislative documents. No production deployment = no URL database.

### Councilmatic:
It **consumes** Legistar data, doesn't produce new URLs. Going through Councilmatic to get Legistar URLs is like downloading a restaurant review site to find the restaurant's address - just go to the restaurant directly!

---

## βœ… Bottom Line

**YES, City Scrapers has URLs** - βœ… **Already integrated!**

**YES, CDP has URLs** - ⏳ **Next priority to extract**

**Others are libraries/research** - No URLs to extract, but we use their patterns

See [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py) for the City Scrapers integration that just got implemented! πŸŽ‰