open-navigator / website /docs /data-sources /video-sources.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
metadata
displayed_sidebar: policyMakersSidebar

Complete Video Channel Discovery Sources

Comprehensive guide to all data sources for discovering local government video channels


Summary Table

Source Type Coverage Quality Status Priority
ELGL Top Channels 🌟 Curated List 50-100 channels ⭐⭐⭐⭐⭐ Highest βœ… Ready πŸ”₯ CRITICAL
NACo County Database 🌟 Official Database 3,143 counties ⭐⭐⭐⭐⭐ Highest βœ… Ready πŸ”₯ CRITICAL
MeetingBank Dataset 6 cities, 1,366 meetings ⭐⭐⭐⭐ High βœ… Integrated DONE
Open States API 50+ state legislatures ⭐⭐⭐⭐ High βœ… Integrated DONE
Social Media Scraping Web Scraping 3,000-5,000 cities ⭐⭐⭐ Medium βœ… Implemented In Progress
USA.gov Directory Federal Registry All cities/counties ⭐⭐⭐⭐⭐ Highest πŸ“‹ Planned πŸ”₯ HIGH
City Scrapers GitHub Repos 100-500 agencies ⭐⭐⭐ Medium ⚠️ Partial MEDIUM
Council Data Project Platform 20 cities ⭐⭐⭐⭐ High πŸ“‹ Planned HIGH
Federal Agencies Curated 50+ state health depts ⭐⭐⭐ Medium πŸ“‹ Planned LOW

🌟 NEW: Curated Sources (Your Suggestions!)

1. ELGL (Engaging Local Government Leaders)

What They Provide:

  • "Top Local Government YouTube Channels" annual lists
  • Curated by experts in local government innovation
  • Highlights the MOST ACTIVE channels nationwide
  • Focus on quality over quantity

Why This is CRITICAL:

βœ… Expert-curated, not automated
βœ… Top tier quality - channels with best content
βœ… Most active local governments
βœ… Innovation leaders in digital communication
βœ… Saves time - don't scrape 10,000 cities, get the top 100!

Sources:

  • ELGL Blog: https://elgl.org/
  • Annual articles: "Top Local Government YouTube Channels 2024", 2023, etc.
  • ELGL Conference presentations
  • Digital innovation showcases

Expected Coverage:

  • 50-100 channels (most active)
  • Major cities: Seattle, Austin, Denver, etc.
  • Innovative smaller cities
  • County governments
  • Regional districts

Example Channels (Likely in ELGL Lists):

Implementation: βœ… discovery/curated_sources.py - ELGLYouTubeDiscovery class

How to Use:

from discovery.curated_sources import ELGLYouTubeDiscovery

async with ELGLYouTubeDiscovery() as elgl:
    top_channels = await elgl.scrape_elgl_top_channels()
    
# Results: 50-100 top-tier YouTube channels with metadata

2. NACo (National Association of Counties)

What They Provide:

  • County Explorer Database - all 3,143 U.S. counties
  • Official county website URLs
  • Digital Counties Survey - innovation leaders
  • County communications/media awards

Why This is CRITICAL:

βœ… COMPREHENSIVE - ALL 3,143 counties covered
βœ… Official database maintained by NACo
βœ… Digital innovation showcase (video/media leaders)
βœ… Authoritative URLs (verified by county association)
βœ… Partnership opportunities for data access

Sources:

Expected Coverage:

  • 3,143 counties with official websites
  • 100+ counties highlighted for digital innovation
  • County media hubs and communication portals
  • Video streaming platforms

County Categories:

  • Large counties (500k+ population): ~100 counties - most have video
  • Medium counties (100k-500k): ~400 counties - many have video
  • Small counties (<100k): ~2,600 counties - fewer with video
  • Digital Innovation Leaders: ~100 counties with advanced media

Implementation: βœ… discovery/curated_sources.py - NACoCountyDiscovery class

How to Use:

from discovery.curated_sources import NACoCountyDiscovery

async with NACoCountyDiscovery() as naco:
    # Get all county websites
    counties = await naco.get_naco_county_websites()
    
    # Get digital innovation showcase
    innovations = await naco.scrape_naco_digital_innovation()

# Results: 3,143 county websites + digital innovation leaders

Partnership Opportunity: NACo may provide:

  • Bulk data export of county websites
  • API access to County Explorer
  • Research collaboration for public benefit
  • Validation/verification partnership

πŸ“Š Existing Dataset Sources

3. MeetingBank (HuggingFace)

Status: βœ… INTEGRATED

Coverage:

  • 1,366 meetings from 6 cities
  • Alameda, Boston, Denver, King County, Long Beach, Seattle

Video URLs:

  • YouTube IDs β†’ YouTube URLs
  • Vimeo IDs β†’ Vimeo URLs
  • Archive.org collections

Implementation: discovery/meetingbank_ingestion.py

Quality: ⭐⭐⭐⭐ Very high - academic benchmark dataset


4. Open States (API)

Status: βœ… INTEGRATED

Coverage:

  • 50+ state legislatures
  • State-level YouTube channels
  • Vimeo accounts
  • Granicus portals

Implementation: discovery/openstates_sources.py

Quality: ⭐⭐⭐⭐ High - official API data


5. City Scrapers (GitHub)

Status: ⚠️ PARTIAL

Coverage:

  • 100-500 agency URLs
  • Chicago (~100), Pittsburgh, Detroit, Cleveland, LA

What's Missing:

  • Video URL extraction from Granicus pages
  • YouTube embedded video scraping

Implementation: discovery/city_scrapers_urls.py

Quality: ⭐⭐⭐ Good - validated URLs but needs video extraction


🌐 Web Discovery Sources

6. Social Media Footer Scraping

Status: βœ… IMPLEMENTED (NEW!)

How it Works:

  • Takes government homepage URLs
  • Scrapes footer sections for social links
  • Checks contact/about pages
  • Extracts YouTube, Facebook, Twitter, Vimeo

Coverage:

  • 3,000-5,000 cities with social media
  • Most cities link YouTube in footer

Implementation: discovery/social_media_discovery.py

Quality: ⭐⭐⭐ Good - automated discovery

Test Results:

βœ“ Seattle: Found 8 social links (2 YouTube, 3 Facebook, 3 Twitter)

7. USA.gov Local Directory

Status: πŸ“‹ PLANNED (HIGH PRIORITY)

Why This Matters:

  • Federal verification of official websites
  • Most authoritative homepage URLs
  • Can cross-reference with NACo/ELGL

Coverage:

  • All cities/counties in U.S.
  • Official .gov verification

Quality: ⭐⭐⭐⭐⭐ Highest - federal stamp of authority


8. Council Data Project

Status: πŸ“‹ PLANNED

Coverage:

  • 20+ cities with full pipelines
  • Seattle, Portland, Boston, Denver, etc.

What They Have:

  • Official meeting video URLs
  • YouTube channels
  • Granicus portals

Quality: ⭐⭐⭐⭐ High - production deployments


πŸ›οΈ Federal & State Sources

9. Federal Agency Channels

Status: πŸ“‹ PLANNED

Coverage:

  • CDC, HRSA, CMS (federal)
  • 50 state health departments
  • State oral health programs

Use Case:

  • State-level policy
  • Federal program tracking

Quality: ⭐⭐⭐ Medium - supplementary


🎯 Recommended Implementation Strategy

Phase 1: Curated Sources (HIGHEST ROI) πŸ”₯

Why Start Here:

  • Get 50-100 TOP channels immediately (ELGL)
  • Get 3,143 county websites (NACo)
  • Highest quality, verified data
  • Fast implementation

Steps:

  1. βœ… Scrape ELGL "Top YouTube Channels" articles
  2. βœ… Contact NACo for County Explorer data export
  3. Flag these as "Tier 1 - Curated" in database
  4. Prioritize for content analysis

Timeline: 1-2 weeks
Expected Results: 50-100 top channels + 3,143 county websites


Phase 2: Dataset Extraction

Why Second:

  • Already have datasets downloaded
  • Known good quality
  • Fill gaps from curated sources

Steps:

  1. βœ… MeetingBank video URLs (DONE)
  2. βœ… Open States channels (DONE)
  3. Extract City Scrapers Granicus videos
  4. Integrate Council Data Project URLs

Timeline: 1-2 weeks
Expected Results: +1,500 meeting videos


Phase 3: Website Scraping (Scale)

Why Third:

  • After curated sources, find remaining channels
  • Automated discovery for comprehensive coverage
  • Ongoing monitoring for new channels

Steps:

  1. βœ… Social media footer scraping (DONE)
  2. USA.gov directory integration
  3. Batch process 3,000+ cities
  4. Validate discovered channels

Timeline: 2-4 weeks
Expected Results: +3,000-5,000 channels


πŸ“ˆ Expected Outcomes

Coverage by Tier

Tier 1: Curated (ELGL + NACo Digital Innovation)

  • 50-100 most active YouTube channels
  • ~100 digital innovation leader counties
  • ⭐⭐⭐⭐⭐ Quality: Highest
  • 🎯 Priority: CRITICAL for analysis

Tier 2: Dataset Verified (MeetingBank, Open States, CDP)

  • 1,366 meetings with videos (MeetingBank)
  • 50+ state legislature channels
  • 20+ CDP cities
  • ⭐⭐⭐⭐ Quality: High
  • βœ… Status: Mostly integrated

Tier 3: Discovered (Website Scraping)

  • 3,000-5,000 cities with YouTube
  • 3,143 county websites (NaCo base)
  • 10,000+ social media accounts
  • ⭐⭐⭐ Quality: Medium
  • πŸ“Š Use: Comprehensive coverage

Total Potential

Metric Count Source
YouTube Channels 3,000-5,000 Combined
Top-Tier Channels 50-100 ELGL ⭐
County Websites 3,143 NACo ⭐
Digital Leaders ~200 ELGL + NACo ⭐
Meeting Videos 1,366+ MeetingBank
State Legislatures 50+ Open States
Granicus Portals 1,000+ Various
Facebook Pages 10,000+ Scraping

πŸš€ Next Steps

This Week

  1. Test ELGL Scraper βœ… READY

    python discovery/curated_sources.py
    
  2. Contact NACo

    • Request County Explorer data export
    • Discuss research partnership
    • Get digital innovation list
  3. Integrate ELGL Channels

    • Parse "Top Channels" articles
    • Save to Bronze layer: bronze/elgl_top_channels
    • Flag as Tier 1 priority

Next 2 Weeks

  1. NACo Integration

    • Implement County Explorer data import
    • Scrape digital innovation showcase
    • Cross-reference with GSA .gov domains
  2. USA.gov Directory

    • Scrape local directory
    • Use for homepage verification
    • Supplement NACo county URLs
  3. Quality Tiers

    • Tier 1: ELGL + NACo innovation
    • Tier 2: Dataset channels
    • Tier 3: Web discovered

Next Month

  1. Scale to 1,000+ Cities
  2. Automated Validation
  3. Content Analysis (focus on Tier 1 first!)

πŸ“ž Contact Information

Data Partnerships

ELGL (Engaging Local Government Leaders)

NACo (National Association of Counties)


Conclusion

Your suggestions to use ELGL and NACo are EXCELLENT! These curated sources provide:

βœ… Quality over Quantity - Get the 50-100 BEST channels first
βœ… Authoritative Data - NACo maintains all 3,143 counties
βœ… Expert Curation - ELGL highlights innovation leaders
βœ… Fast Implementation - Scrape lists instead of 10,000 websites
βœ… Partnership Opportunities - Collaborate with ELGL/NACo

These should be PRIORITY 1 for implementation - they provide the highest quality data with the least effort!