open-navigator / docs /URL_DATASET_INVESTIGATION.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# URL Dataset Investigation: What Already Exists
## πŸ” Current Situation
**What we're doing now**:
- Matching Census jurisdiction names to CISA .gov domains
- Result: 76 URLs from 500 jurisdictions tested (15% match rate)
**What we should check**:
- Do these civic tech projects already have URL datasets?
- Are there HuggingFace datasets with government URLs?
- Can we download their jurisdiction lists instead of discovering ourselves?
---
## πŸ“Š Known URL Datasets to Investigate
### 1. LocalView (Harvard)
**Website**: https://www.localview.net
**Claim**: "Largest known public dataset of local government meetings"
**To Check**:
- [ ] Do they publish their jurisdiction URL list?
- [ ] Is their dataset on HuggingFace?
- [ ] Can we download their meeting database?
- [ ] What's their coverage? (They claim nationwide)
**Expected**: URL list for hundreds/thousands of jurisdictions with videos
---
### 2. CivicBand
**Website**: https://civic.band
**Claim**: "1,000+ municipalities & counties"
**To Check**:
- [ ] Do they publish their URL list?
- [ ] Can we access their database?
- [ ] What jurisdictions do they cover?
**Expected**: 1,000+ verified government website URLs
---
### 3. Council Data Project (CDP)
**Website**: https://councildataproject.org
**GitHub**: https://github.com/CouncilDataProject
**To Check**:
- [ ] List of deployed instances (each = 1 jurisdiction URL)
- [ ] cdp-data repository (may have jurisdiction lists)
- [ ] Individual deployment configs (contain URLs)
**Expected**: 10-20 cities with full deployment
**Known CDP Cities** (from their website):
- Seattle, WA
- King County, WA
- Portland, OR
- Denver, CO
- Boston, MA
- Oakland, CA
- Alameda County, CA
- Charlotte, NC
- Louisville, KY
---
### 4. City Scrapers
**Website**: https://cityscrapers.org
**GitHub**: https://github.com/city-scrapers
**To Check**:
- [ ] City Scrapers Chicago (has ~100 agencies)
- [ ] Other City Scrapers deployments (Pittsburgh, Detroit, etc.)
- [ ] Scraper list = jurisdiction URLs
**Expected**: 5-10 cities, each with dozens of agencies/boards
**Known City Scrapers Cities**:
- Chicago (100+ agencies)
- Pittsburgh
- Detroit
- Cleveland
- Los Angeles
---
### 5. Civic Scraper (Big Local News)
**GitHub**: https://github.com/biglocalnews/civic-scraper
**To Check**:
- [ ] Example jurisdiction list in repo
- [ ] Big Local News datasets
- [ ] Stanford Journalism data releases
**Expected**: Curated list of news-worthy jurisdictions
---
### 6. Councilmatic (DataMade)
**Website**: Various (councilmatic.org redirects)
**GitHub**: https://github.com/datamade
**To Check**:
- [ ] List of Councilmatic deployments
- [ ] Each deployment = 1 city URL
**Known Councilmatic Cities**:
- Chicago, IL (chicagocouncilmatic.org)
- New York, NY (nyc.councilmatic.org)
- Philadelphia, PA (philly.councilmatic.org)
- Los Angeles, CA
- Miami, FL
- Denver, CO
---
### 7. MeetingBank Dataset
**Website**: https://meetingbank.github.io
**To Check**:
- [ ] Download their dataset
- [ ] Extract jurisdiction URLs
- [ ] See what 6 cities they cover
**Known Cities**: 6 large cities (need to check which)
---
### 8. OpenTowns
**Website**: https://opentowns.org
**To Check**:
- [ ] List of covered towns
- [ ] URL database
- [ ] Focus: small towns/cities
**Expected**: Dozens of small municipalities
---
### 9. OpenCouncil (Greece)
**GitHub**: https://github.com/schemalabz/opencouncil
**To Check**:
- [ ] Greek municipality URLs (international example)
- [ ] May have patterns for finding URLs
**Expected**: Greek councils (not directly useful, but patterns may be)
---
## πŸ—‚οΈ HuggingFace Datasets to Check
### Search Terms:
- "local government"
- "city council"
- "municipal meetings"
- "government websites"
- "civic data"
- "legistar"
- "granicus"
### Potential Datasets:
- [ ] Search: huggingface.co/datasets?search=government
- [ ] Search: huggingface.co/datasets?search=municipal
- [ ] Search: huggingface.co/datasets?search=council
- [ ] Check if LocalView uploaded their data
- [ ] Check if City Scrapers uploaded data
- [ ] Check Big Local News org
---
## πŸ“‹ Other Data Sources
### 1. Legistar Client List
**Why**: Legistar is the most popular council management system
**To Check**:
- [ ] Granicus website (they own Legistar)
- [ ] Public list of Legistar clients
- [ ] Scrape legistar.com subdomain list
**Expected**: 1,000+ city council URLs
### 2. Granicus Client List
**Why**: Major civic engagement platform
**To Check**:
- [ ] Granicus website client showcase
- [ ] govdelivery.com (owned by Granicus)
**Expected**: Hundreds of jurisdictions
### 3. CivicPlus Client List
**Why**: Popular municipal website platform
**To Check**:
- [ ] CivicPlus website
- [ ] Public client list
### 4. Municode Client List
**Why**: Online code/ordinance hosting
**To Check**:
- [ ] municode.com directory
- [ ] List of hosted municipalities
---
## 🎯 OPTIMAL STRATEGY
Instead of trying to match jurisdiction names to domains, we should:
### Phase 1: Download Existing Datasets (1-2 days)
1. **LocalView dataset** β†’ Likely has 100s-1000s of URLs with videos
2. **CivicBand database** β†’ 1,000+ municipalities
3. **CDP deployments** β†’ 10-20 cities (high quality)
4. **City Scrapers instances** β†’ 5+ cities, 100s of agencies
5. **Councilmatic deployments** β†’ 6+ major cities
6. **MeetingBank** β†’ 6 cities with transcripts
**Expected total**: 2,000-5,000 high-quality URLs
### Phase 2: Platform Client Lists (1 week)
1. **Legistar subdomain enumeration** β†’ city.legistar.com
2. **Granicus client list** β†’ From their website
3. **CivicPlus client list**
4. **Municode directory**
**Expected total**: 5,000-10,000 URLs
### Phase 3: Census + CISA Matching (current approach)
Keep our current system as fallback for uncovered jurisdictions
**Expected additional**: 1,000-2,000 URLs
---
## πŸ”₯ IMMEDIATE ACTIONS
### 1. Check LocalView Dataset
```bash
# Visit: https://www.localview.net
# Look for: "Download" or "Data" link
# Check: GitHub repo for dataset links
```
### 2. Check HuggingFace
```bash
# Search: https://huggingface.co/datasets?search=local+government
# Search: https://huggingface.co/datasets?search=council+meetings
```
### 3. Scrape Legistar Subdomains
```python
# Try common patterns:
# {city}.legistar.com
# {city}-{state}.legistar.com
# {county}.legistar.com
```
### 4. CDP Deployment List
```bash
# Check: https://councildataproject.org
# Each deployment has a URL in the config
```
### 5. City Scrapers Jurisdiction List
```bash
# Check: https://github.com/city-scrapers/city-scrapers
# Each scraper file = 1 or more agencies
```
---
## πŸ’‘ WHY THIS MATTERS
**Current approach**: 76 URLs from 500 jurisdictions = 15% match rate
**Using existing datasets**: Could get 5,000-10,000 URLs immediately
**ROI**:
- LocalView alone might have 1,000+ URLs ready to use
- Legistar enumeration could yield 3,000+ URLs
- Combined: 10x more coverage with less work
---
## πŸ“ RECOMMENDATION
**DO THIS FIRST**:
1. βœ… Investigate LocalView dataset (highest potential)
2. βœ… Check HuggingFace for government data
3. βœ… Enumerate Legistar subdomains (legistar.com/*.legistar.com)
4. βœ… Get CDP deployment URLs
5. βœ… Extract City Scrapers jurisdiction lists
**THEN**:
Keep our Census + CISA matching as a fallback for smaller jurisdictions not covered by above.
---
## ⚠️ KEY INSIGHT
We've been trying to **discover** URLs when we should be **downloading** existing URL lists from projects that already did this work!
The civic tech community has likely already mapped thousands of URLs. We should:
1. Download their datasets
2. Extract their URL lists
3. Add our discoveries to fill gaps
This is the "don't reinvent the wheel" principle applied to URL discovery.