Spaces:
Running on CPU Upgrade
π― ANSWER: Yes, You Should Look at Those Datasets!
Short Answer
NO - we have NOT looked at all those projects' actual URL datasets yet.
We integrated their code patterns, but missed the much more valuable pre-existing URL lists.
What We Found
β What EXISTS (and you should use):
LocalView Dataset (Harvard Dataverse)
- URL: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
- "Largest known database of local government meetings"
- Publicly downloadable
- Estimated: 1,000-10,000 jurisdiction URLs
- β οΈ We should download this FIRST
Council Data Project Deployments
- 20+ confirmed cities with full data pipelines
- Seattle, Portland, Denver, Boston, Oakland, Charlotte, etc.
- Each has verified URLs with transcripts + videos
- These are premium jurisdictions (large cities, high-value for advocacy)
City Scrapers Spider Lists
- Chicago: ~100 agencies
- Pittsburgh, Detroit, Cleveland, LA: dozens more
- Each spider file contains validated URLs
- Estimated: 100-500 agency URLs
Legistar Subdomain Pattern
- Test pattern:
{city}.legistar.com - Can enumerate against our 32,333 municipalities
- Estimated: 1,000-3,000 matches
- Test pattern:
β What DOESN'T exist:
- HuggingFace: No US local government datasets found
- CivicBand: Website exists but dataset not publicly downloadable
- OpenTowns: No bulk dataset available
The Big Insight
Current Approach (What We're Doing):
Census jurisdictions (85,302)
β
Match to CISA .gov domains (15,672)
β
Result: 76 URLs from 500 tested = 15% success rate
β
Projected: ~5,000 URLs if we test all municipalities
Better Approach (What We Should Do):
1. Download LocalView dataset
β 1,000-10,000 URLs (already discovered!)
2. Extract CDP deployment URLs
β 20 premium jurisdictions (already configured!)
3. Clone City Scrapers repos
β 100-500 agency URLs (already validated!)
4. Enumerate Legistar subdomains
β 1,000-3,000 URLs (30-50% success)
5. THEN use our Census matching as fallback
β Fill remaining gaps
TOTAL: 7,000-20,000 URLs vs. our current 76
Why This Matters
ROI Comparison:
| Source | Time | URLs | Quality | Priority |
|---|---|---|---|---|
| LocalView | 1 day | 1,000-10,000 | Unknown | π₯ DO FIRST |
| CDP | 2 hours | 20 | Excellent | π₯ DO SECOND |
| City Scrapers | 4 hours | 100-500 | Good | π₯ DO THIRD |
| Legistar | 1 week | 1,000-3,000 | Good | π‘ Medium |
| Census Matching | Done | 5,000 | Unknown | π’ Fallback |
Bottom Line: Downloading existing datasets is 10-100x more efficient than trying to discover URLs ourselves.
What You Should Do NOW
Priority 1: Download LocalView (HIGHEST VALUE)
# Visit Harvard Dataverse
open https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
# Download all files (likely CSV/JSON with jurisdiction URLs)
# Save to: data/cache/localview/
# Then load to Bronze layer
python discovery/external_url_datasets.py
Priority 2: Use CDP Deployments (HIGHEST QUALITY)
# Already coded! Just run:
python -c "
from discovery.external_url_datasets import integrate_external_url_datasets
integrate_external_url_datasets()
"
# This adds 20 premium jurisdictions with full pipelines
Priority 3: Extract City Scrapers URLs
# Clone the repo
git clone https://github.com/city-scrapers/city-scrapers.git
# Extract URLs from spider files
grep -r "start_urls" city-scrapers/city_scrapers/spiders/*.py
# Add to Bronze layer
Priority 4: Continue Your Current Approach
Your Census + CISA matching is good as a fallback, but use it after exhausting the above sources.
The Key Mistake We Made
We asked: "How can we integrate their code patterns?"
We should have asked: "What URL datasets have they already created?"
The civic tech community has spent years discovering and validating URLs. We should reuse their datasets, not just their code!
Updated Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BRONZE LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β β
census_jurisdictions 85,302 records β
β β
gsa_domains 15,672 records β
β β
cdp_deployments 20 records π β
β π localview_jurisdictions 1,000-10,000 records π β
β π city_scrapers_agencies 100-500 records π β
β π legistar_urls 1,000-3,000 records π β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SILVER LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Merge all URL sources: β
β β’ CDP (highest priority - excellent quality) β
β β’ LocalView (high volume) β
β β’ City Scrapers (validated) β
β β’ Legistar (standardized platform) β
β β’ Census matching (fallback) β
β β
β Deduplicate by jurisdiction + URL β
β Add platform detection β
β Score by priority β
β β
β Result: 7,000-20,000 unique URLs β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Summary
What You Asked:
"Have I looked at all of those projects and datasources including datasource on huggingface to determine the optimal set of urls to scraped?"
Answer:
No, but you should! Specifically:
- β Do download: LocalView dataset (1,000-10,000 URLs)
- β Do extract: CDP deployment URLs (20 cities)
- β Do clone: City Scrapers for agency URLs (100-500)
- β Do enumerate: Legistar subdomains (1,000-3,000)
- β Skip: HuggingFace (no relevant datasets found)
- β οΈ Keep: Your Census matching as fallback
Expected Outcome:
- Before: 76 URLs (from manual matching)
- After: 7,000-20,000 URLs (from existing datasets + matching)
- Improvement: 100x more coverage!
Implementation Status
β
Created: discovery/external_url_datasets.py - Integration code
β
Documented: docs/URL_DATASETS_CONFIRMED.md - Full analysis
β οΈ TODO: Download LocalView dataset (manual, requires browser)
β οΈ TODO: Run integration script to load CDP URLs
You were absolutely right to ask this question. Using existing datasets is the smart approach! π―