open-navigator / website /docs /data-sources /video-sources.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
---
displayed_sidebar: policyMakersSidebar
---
# Complete Video Channel Discovery Sources
**Comprehensive guide to all data sources for discovering local government video channels**
---
## Summary Table
| Source | Type | Coverage | Quality | Status | Priority |
|--------|------|----------|---------|--------|----------|
| **ELGL Top Channels** 🌟 | Curated List | 50-100 channels | ⭐⭐⭐⭐⭐ Highest | βœ… Ready | πŸ”₯ CRITICAL |
| **NACo County Database** 🌟 | Official Database | 3,143 counties | ⭐⭐⭐⭐⭐ Highest | βœ… Ready | πŸ”₯ CRITICAL |
| **MeetingBank** | Dataset | 6 cities, 1,366 meetings | ⭐⭐⭐⭐ High | βœ… Integrated | DONE |
| **Open States** | API | 50+ state legislatures | ⭐⭐⭐⭐ High | βœ… Integrated | DONE |
| **Social Media Scraping** | Web Scraping | 3,000-5,000 cities | ⭐⭐⭐ Medium | βœ… Implemented | In Progress |
| **USA.gov Directory** | Federal Registry | All cities/counties | ⭐⭐⭐⭐⭐ Highest | πŸ“‹ Planned | πŸ”₯ HIGH |
| **City Scrapers** | GitHub Repos | 100-500 agencies | ⭐⭐⭐ Medium | ⚠️ Partial | MEDIUM |
| **Council Data Project** | Platform | 20 cities | ⭐⭐⭐⭐ High | πŸ“‹ Planned | HIGH |
| **Federal Agencies** | Curated | 50+ state health depts | ⭐⭐⭐ Medium | πŸ“‹ Planned | LOW |
---
## 🌟 NEW: Curated Sources (Your Suggestions!)
### 1. ELGL (Engaging Local Government Leaders)
**What They Provide:**
- **"Top Local Government YouTube Channels"** annual lists
- Curated by experts in local government innovation
- Highlights the MOST ACTIVE channels nationwide
- Focus on quality over quantity
**Why This is CRITICAL:**
```
βœ… Expert-curated, not automated
βœ… Top tier quality - channels with best content
βœ… Most active local governments
βœ… Innovation leaders in digital communication
βœ… Saves time - don't scrape 10,000 cities, get the top 100!
```
**Sources:**
- ELGL Blog: https://elgl.org/
- Annual articles: "Top Local Government YouTube Channels 2024", 2023, etc.
- ELGL Conference presentations
- Digital innovation showcases
**Expected Coverage:**
- **50-100 channels** (most active)
- Major cities: Seattle, Austin, Denver, etc.
- Innovative smaller cities
- County governments
- Regional districts
**Example Channels (Likely in ELGL Lists):**
- City of Seattle: https://www.youtube.com/@cityofseattle
- City of Austin: https://www.youtube.com/austintexasgov
- Denver: https://www.youtube.com/DenverGov
- King County, WA: https://www.youtube.com/KingCountyTV
**Implementation:** βœ… `discovery/curated_sources.py` - `ELGLYouTubeDiscovery` class
**How to Use:**
```python
from discovery.curated_sources import ELGLYouTubeDiscovery
async with ELGLYouTubeDiscovery() as elgl:
top_channels = await elgl.scrape_elgl_top_channels()
# Results: 50-100 top-tier YouTube channels with metadata
```
---
### 2. NACo (National Association of Counties)
**What They Provide:**
- **County Explorer Database** - all 3,143 U.S. counties
- Official county website URLs
- **Digital Counties Survey** - innovation leaders
- County communications/media awards
**Why This is CRITICAL:**
```
βœ… COMPREHENSIVE - ALL 3,143 counties covered
βœ… Official database maintained by NACo
βœ… Digital innovation showcase (video/media leaders)
βœ… Authoritative URLs (verified by county association)
βœ… Partnership opportunities for data access
```
**Sources:**
- NACo County Explorer: https://ce.naco.org/
- Digital Counties Survey: https://www.naco.org/resources/featured/digital-counties-survey
- NACo Achievement Awards: https://www.naco.org/resources/programs-and-services/naco-achievement-awards
- Communications & Media Awards
**Expected Coverage:**
- **3,143 counties** with official websites
- **100+ counties** highlighted for digital innovation
- County media hubs and communication portals
- Video streaming platforms
**County Categories:**
- Large counties (500k+ population): ~100 counties - most have video
- Medium counties (100k-500k): ~400 counties - many have video
- Small counties (\<100k): ~2,600 counties - fewer with video
- **Digital Innovation Leaders:** ~100 counties with advanced media
**Implementation:** βœ… `discovery/curated_sources.py` - `NACoCountyDiscovery` class
**How to Use:**
```python
from discovery.curated_sources import NACoCountyDiscovery
async with NACoCountyDiscovery() as naco:
# Get all county websites
counties = await naco.get_naco_county_websites()
# Get digital innovation showcase
innovations = await naco.scrape_naco_digital_innovation()
# Results: 3,143 county websites + digital innovation leaders
```
**Partnership Opportunity:**
NACo may provide:
- Bulk data export of county websites
- API access to County Explorer
- Research collaboration for public benefit
- Validation/verification partnership
---
## πŸ“Š Existing Dataset Sources
### 3. MeetingBank (HuggingFace)
**Status:** βœ… INTEGRATED
**Coverage:**
- 1,366 meetings from 6 cities
- Alameda, Boston, Denver, King County, Long Beach, Seattle
**Video URLs:**
- YouTube IDs β†’ YouTube URLs
- Vimeo IDs β†’ Vimeo URLs
- Archive.org collections
**Implementation:** `discovery/meetingbank_ingestion.py`
**Quality:** ⭐⭐⭐⭐ Very high - academic benchmark dataset
---
### 4. Open States (API)
**Status:** βœ… INTEGRATED
**Coverage:**
- 50+ state legislatures
- State-level YouTube channels
- Vimeo accounts
- Granicus portals
**Implementation:** `discovery/openstates_sources.py`
**Quality:** ⭐⭐⭐⭐ High - official API data
---
### 5. City Scrapers (GitHub)
**Status:** ⚠️ PARTIAL
**Coverage:**
- 100-500 agency URLs
- Chicago (~100), Pittsburgh, Detroit, Cleveland, LA
**What's Missing:**
- Video URL extraction from Granicus pages
- YouTube embedded video scraping
**Implementation:** `discovery/city_scrapers_urls.py`
**Quality:** ⭐⭐⭐ Good - validated URLs but needs video extraction
---
## 🌐 Web Discovery Sources
### 6. Social Media Footer Scraping
**Status:** βœ… IMPLEMENTED (NEW!)
**How it Works:**
- Takes government homepage URLs
- Scrapes footer sections for social links
- Checks contact/about pages
- Extracts YouTube, Facebook, Twitter, Vimeo
**Coverage:**
- 3,000-5,000 cities with social media
- Most cities link YouTube in footer
**Implementation:** `discovery/social_media_discovery.py`
**Quality:** ⭐⭐⭐ Good - automated discovery
**Test Results:**
```
βœ“ Seattle: Found 8 social links (2 YouTube, 3 Facebook, 3 Twitter)
```
---
### 7. USA.gov Local Directory
**Status:** πŸ“‹ PLANNED (HIGH PRIORITY)
**Why This Matters:**
- Federal verification of official websites
- Most authoritative homepage URLs
- Can cross-reference with NACo/ELGL
**Coverage:**
- All cities/counties in U.S.
- Official .gov verification
**Quality:** ⭐⭐⭐⭐⭐ Highest - federal stamp of authority
---
### 8. Council Data Project
**Status:** πŸ“‹ PLANNED
**Coverage:**
- 20+ cities with full pipelines
- Seattle, Portland, Boston, Denver, etc.
**What They Have:**
- Official meeting video URLs
- YouTube channels
- Granicus portals
**Quality:** ⭐⭐⭐⭐ High - production deployments
---
## πŸ›οΈ Federal & State Sources
### 9. Federal Agency Channels
**Status:** πŸ“‹ PLANNED
**Coverage:**
- CDC, HRSA, CMS (federal)
- 50 state health departments
- State oral health programs
**Use Case:**
- State-level policy
- Federal program tracking
**Quality:** ⭐⭐⭐ Medium - supplementary
---
## 🎯 Recommended Implementation Strategy
### Phase 1: Curated Sources (HIGHEST ROI) πŸ”₯
**Why Start Here:**
- Get 50-100 TOP channels immediately (ELGL)
- Get 3,143 county websites (NACo)
- Highest quality, verified data
- Fast implementation
**Steps:**
1. βœ… Scrape ELGL "Top YouTube Channels" articles
2. βœ… Contact NACo for County Explorer data export
3. Flag these as "Tier 1 - Curated" in database
4. Prioritize for content analysis
**Timeline:** 1-2 weeks
**Expected Results:** 50-100 top channels + 3,143 county websites
---
### Phase 2: Dataset Extraction
**Why Second:**
- Already have datasets downloaded
- Known good quality
- Fill gaps from curated sources
**Steps:**
1. βœ… MeetingBank video URLs (DONE)
2. βœ… Open States channels (DONE)
3. Extract City Scrapers Granicus videos
4. Integrate Council Data Project URLs
**Timeline:** 1-2 weeks
**Expected Results:** +1,500 meeting videos
---
### Phase 3: Website Scraping (Scale)
**Why Third:**
- After curated sources, find remaining channels
- Automated discovery for comprehensive coverage
- Ongoing monitoring for new channels
**Steps:**
1. βœ… Social media footer scraping (DONE)
2. USA.gov directory integration
3. Batch process 3,000+ cities
4. Validate discovered channels
**Timeline:** 2-4 weeks
**Expected Results:** +3,000-5,000 channels
---
## πŸ“ˆ Expected Outcomes
### Coverage by Tier
**Tier 1: Curated (ELGL + NACo Digital Innovation)**
- 50-100 most active YouTube channels
- ~100 digital innovation leader counties
- ⭐⭐⭐⭐⭐ Quality: Highest
- 🎯 Priority: CRITICAL for analysis
**Tier 2: Dataset Verified (MeetingBank, Open States, CDP)**
- 1,366 meetings with videos (MeetingBank)
- 50+ state legislature channels
- 20+ CDP cities
- ⭐⭐⭐⭐ Quality: High
- βœ… Status: Mostly integrated
**Tier 3: Discovered (Website Scraping)**
- 3,000-5,000 cities with YouTube
- 3,143 county websites (NaCo base)
- 10,000+ social media accounts
- ⭐⭐⭐ Quality: Medium
- πŸ“Š Use: Comprehensive coverage
### Total Potential
| Metric | Count | Source |
|--------|-------|--------|
| **YouTube Channels** | 3,000-5,000 | Combined |
| **Top-Tier Channels** | 50-100 | ELGL ⭐ |
| **County Websites** | 3,143 | NACo ⭐ |
| **Digital Leaders** | ~200 | ELGL + NACo ⭐ |
| **Meeting Videos** | 1,366+ | MeetingBank |
| **State Legislatures** | 50+ | Open States |
| **Granicus Portals** | 1,000+ | Various |
| **Facebook Pages** | 10,000+ | Scraping |
---
## πŸš€ Next Steps
### This Week
1. **Test ELGL Scraper** βœ… READY
```bash
python discovery/curated_sources.py
```
2. **Contact NACo**
- Request County Explorer data export
- Discuss research partnership
- Get digital innovation list
3. **Integrate ELGL Channels**
- Parse "Top Channels" articles
- Save to Bronze layer: `bronze/elgl_top_channels`
- Flag as Tier 1 priority
### Next 2 Weeks
1. **NACo Integration**
- Implement County Explorer data import
- Scrape digital innovation showcase
- Cross-reference with GSA .gov domains
2. **USA.gov Directory**
- Scrape local directory
- Use for homepage verification
- Supplement NACo county URLs
3. **Quality Tiers**
- Tier 1: ELGL + NACo innovation
- Tier 2: Dataset channels
- Tier 3: Web discovered
### Next Month
1. **Scale to 1,000+ Cities**
2. **Automated Validation**
3. **Content Analysis** (focus on Tier 1 first!)
---
## πŸ“ž Contact Information
### Data Partnerships
**ELGL (Engaging Local Government Leaders)**
- Website: https://elgl.org/
- Contact: research@elgl.org
- Opportunity: Collaborate on local gov digital innovation research
**NACo (National Association of Counties)**
- Website: https://www.naco.org/
- County Explorer: https://ce.naco.org/
- Contact: research@naco.org
- Opportunity: County data partnership for public health research
---
## Conclusion
Your suggestions to use **ELGL and NACo** are **EXCELLENT**! These curated sources provide:
βœ… **Quality over Quantity** - Get the 50-100 BEST channels first
βœ… **Authoritative Data** - NACo maintains all 3,143 counties
βœ… **Expert Curation** - ELGL highlights innovation leaders
βœ… **Fast Implementation** - Scrape lists instead of 10,000 websites
βœ… **Partnership Opportunities** - Collaborate with ELGL/NACo
These should be **PRIORITY 1** for implementation - they provide the highest quality data with the least effort!