Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / website /docs /data-sources /video-channels.md

jcbowyer

Clean HuggingFace deployment without binary files

61d29fc 28 days ago

preview code

raw

history blame contribute delete

18.1 kB

metadata

displayed_sidebar: developersSidebar

Video Channel Discovery: Current State & Enhancement Plan

Executive Summary

Question: Does this repo look at local government websites and attempt to discover their YouTube, Facebook, or other video channels?

Answer:

❌ Currently NO - The repo does NOT scrape government websites for social media links
✅ Partially YES - It extracts video URLs from pre-existing datasets (MeetingBank, Open States)
✅ NEW - We've created social_media_discovery.py to implement your suggestion

Current State: What Exists

✅ Video Discovery from Datasets (Offline Sources)

1. MeetingBank (HuggingFace)

File: discovery/meetingbank_ingestion.py
Status: ✅ Working - Video URLs ARE being extracted
Coverage: 1,366 meetings from 6 cities
Video Sources:
- YouTube URLs (extracted from urls['youtube_id'])
- Vimeo URLs (extracted from urls['vimeo_id'])
- Archive.org collections (alameda, boston, denver, king-county, long-beach, seattle)

2. Open States (API)

File: discovery/openstates_sources.py
Status: ✅ Working
Coverage: 50+ state legislatures
Extracts: YouTube channels, Vimeo accounts, Granicus portals from jurisdiction metadata

3. City Scrapers (GitHub)

File: discovery/city_scrapers_urls.py
Status: ⚠️ Partial - extracts start_urls but not video links yet
Coverage: 100-500 agencies from Chicago, Pittsburgh, Detroit, Cleveland, LA
Note: Granicus video pages with embedded YouTube, but extraction not fully implemented

❌ What's Missing: Website Scraping for Social Media

Current Gap: The repo discovers government homepage URLs but does NOT:

❌ Scrape those websites for social media links
❌ Extract YouTube/Facebook channels from footers
❌ Check "Contact Us" or "About" pages
❌ Use USA.gov or federal aggregators

Existing Homepage Discovery:

File: discovery/url_discovery_agent.py
What it does:
- ✅ Finds government homepage URLs using GSA .gov registry
- ✅ Tests URL patterns (cityname.gov, etc.)
- ✅ Crawls to find minutes/agenda pages
- ❌ But does NOT look for social media links

Your Suggestion: Federal Aggregators & Website Scraping

Excellent Ideas!

1. USA.gov Local Directory

✅ Most accurate way to verify channels are legitimate
✅ Provides official website for every city/county
✅ Most governments link social media in footer/contact sections

Implementation: See section below on how to integrate.

2. USA.gov/archive (Federal Videos)

⚠️ Good for federal agencies
⚠️ Limited local government coverage
✅ Could supplement state-level sources

Use Case: State agencies and federal programs that touch local policy.

NEW Implementation: Social Media Discovery

We've created a new module to implement your suggestion!

File: `discovery/social_media_discovery.py`

What it does:

✅ Takes government homepage URLs (from existing discovery)
✅ Scrapes footer sections for social media links
✅ Checks common contact/about pages
✅ Extracts YouTube, Facebook, Vimeo, Archive.org, Granicus
✅ Validates and cleans URLs
✅ Batch processing for hundreds of jurisdictions

Example Usage:

from discovery.social_media_discovery import SocialMediaDiscovery

jurisdictions = [
    {
        'jurisdiction_id': 'seattle-wa',
        'homepage_url': 'https://www.seattle.gov',
        'jurisdiction_name': 'Seattle',
        'state': 'WA'
    }
]

async with SocialMediaDiscovery() as discovery:
    results = await discovery.discover_batch(jurisdictions)

# Results:
# [
#   {
#     'jurisdiction_name': 'Seattle',
#     'social_media': {
#       'youtube': ['https://www.youtube.com/@cityofseattle'],
#       'facebook': ['https://www.facebook.com/CityofSeattle'],
#       'twitter': ['https://twitter.com/CityofSeattle']
#     },
#     'platform_count': 3,
#     'total_urls': 3
#   }
# ]

Detection Strategy:

Focus on footer sections (most reliable)
Common CSS selectors: footer, [class*="footer"], [class*="social"]
Pattern matching for platform URLs
Validates against known domain patterns

Integration with Existing Pipeline

How Social Media Discovery Fits In

1. URL Discovery Agent (EXISTING)
   └─> Finds government homepage URLs
        └─> Uses GSA .gov registry
        └─> Pattern matching (cityname.gov)
        └─> Validates URLs

2. Social Media Discovery (NEW) ⬅️ Add this step
   └─> Scrapes homepages for social links
        └─> Checks footer sections
        └─> Checks contact/about pages
        └─> Extracts YouTube, Facebook, etc.

3. Meeting Scraper (EXISTING)
   └─> Uses discovered URLs to scrape meetings

Integration Code Example

from discovery.url_discovery_agent import URLDiscoveryAgent
from discovery.social_media_discovery import SocialMediaDiscovery
from discovery.gsa_domains import load_gsa_domains

# Step 1: Discover government websites
gsa_domains = load_gsa_domains()
url_agent = URLDiscoveryAgent(gsa_domains)

jurisdictions = [...]  # From Census data
discovered_urls = await url_agent.discover_batch(jurisdictions)

# Step 2: NEW - Discover social media from those websites
social_discovery = SocialMediaDiscovery()
social_results = await social_discovery.discover_batch(discovered_urls)

# Step 3: Save to Bronze layer
save_to_bronze_layer(social_results, "social_media_channels")

USA.gov Integration Guide

Approach 1: Use USA.gov Local Directory as Homepage Source

What USA.gov Provides:

Official .gov website for every city/county
Most authoritative source for homepage URLs
Can replace/supplement GSA domain matching

Implementation:

# discovery/usa_gov_directory.py (NEW FILE TO CREATE)

import httpx
from bs4 import BeautifulSoup

async def get_usa_gov_local_directory():
    """
    Scrape USA.gov local directory for official city/county websites.
    
    USA.gov maintains a directory at:
    https://www.usa.gov/local-governments
    
    Each state page lists cities/counties with official websites.
    """
    base_url = "https://www.usa.gov/local-governments"
    
    # 1. Get list of states
    # 2. For each state, get list of cities/counties
    # 3. Extract official website URLs
    # 4. Return structured data
    
    pass  # Implementation details


# Then use in url_discovery_agent.py:
def _match_usa_gov_directory(jurisdiction_name, state):
    """
    Match jurisdiction to USA.gov directory entry.
    
    Higher confidence than pattern matching because it's
    verified by federal government.
    """
    usa_gov_url = lookup_usa_gov_directory(jurisdiction_name, state)
    if usa_gov_url:
        return (usa_gov_url, 0.98)  # Very high confidence
    return None

Approach 2: USA.gov/archive for Federal Video Content

Use Case: State health departments, federal programs

# discovery/federal_video_sources.py (NEW FILE TO CREATE)

FEDERAL_VIDEO_CHANNELS = {
    "cdc": {
        "youtube": "https://www.youtube.com/@CDCgov",
        "topics": ["public_health", "oral_health"]
    },
    "hrsa": {
        "youtube": "https://www.youtube.com/@HRSAgov",
        "topics": ["health_centers", "dental_programs"]
    },
    "state_health_depts": {
        # Each state's health department
        "CA": "https://www.youtube.com/@CAPublicHealth",
        "TX": "https://www.youtube.com/@TXHealthHumanServices",
        # ... all 50 states
    }
}

def get_federal_video_sources():
    """
    Get federal agency video channels relevant to oral health policy.
    
    Sources:
    - usa.gov/archive featured channels
    - State health departments
    - CDC, HRSA, CMS channels
    """
    pass

Approach 3: ELGL Top YouTube Channels (NEW - HIGHLY RECOMMENDED!)

What: ELGL (Engaging Local Government Leaders) publishes curated "Top Local Government YouTube Channels" lists

Why This is Excellent:

✅ Curated by experts (not automated scraping)
✅ Highlights MOST ACTIVE channels
✅ Quality > Quantity approach
✅ Updated annually
✅ Covers innovative local governments nationwide

Sources:

ELGL Blog: https://elgl.org/
Annual "Top Local Gov YouTube Channels" articles
Digital innovation showcases

Expected Coverage: 50-100 top-tier channels (most active, highest quality)

Implementation: See discovery/curated_sources.py

Approach 4: NACo County Database (NEW - COMPREHENSIVE!)

What: National Association of Counties maintains database of all 3,143 U.S. counties

Why This is Excellent:

✅ Complete county coverage (all 3,143 counties)
✅ Official website URLs verified by NACo
✅ Digital innovation showcase
✅ Authoritative source for county data
✅ Partnership opportunities

Sources:

NACo County Explorer: https://ce.naco.org/
Digital Counties Survey
NACo Communications Awards

Expected Coverage: 3,143 counties with official websites

Implementation: See discovery/curated_sources.py

Complete Implementation Plan

Phase 1: Enhance Existing Dataset Extraction (✅ DONE)

MeetingBank video URLs (already working)
Open States channels (already working)
City Scrapers Granicus video extraction (TODO)

Phase 2: Website Social Media Discovery (✅ NEW MODULE CREATED)

Implementation:

Create social_media_discovery.py module
Test on sample cities (Seattle, Chicago, Austin)
Integrate with URL discovery pipeline
Write to Bronze layer: bronze/social_media_channels

Tasks:

# Test the new module
cd discovery
python social_media_discovery.py

# Expected output: YouTube, Facebook, Vimeo URLs for test cities

Phase 3: USA.gov Integration (RECOMMENDED)

Priority: HIGH - Most authoritative source

Tasks:

Create discovery/usa_gov_directory.py
Scrape USA.gov local directory for official URLs
Use as primary source (confidence 0.98)
Fallback to pattern matching for missing entries

Estimated URLs:

~3,000 cities/counties with verified .gov URLs
~10,000+ municipalities (including .org, .us domains)

Phase 4: ELGL Curated Channels (NEW - HIGH PRIORITY!)

Priority: HIGH - Quality over quantity

Tasks:

Create discovery/curated_sources.py ✅
Scrape ELGL "Top YouTube Channels" articles
Parse channel URLs and metadata
Flag as "top-ranked" in database

Expected Results:

50-100 most active local government channels
High-quality, verified content
Innovative digital communication examples

Why This Matters: These are the channels with the MOST meeting videos and BEST production quality!

Phase 5: NACo County Database (NEW - HIGH PRIORITY!)

Priority: HIGH - Comprehensive county coverage

Tasks:

Create discovery/curated_sources.py ✅
Contact NACo for data partnership/export
Integrate NACo County Explorer data
Scrape digital innovation showcase
Cross-reference with GSA .gov domains

Expected Results:

All 3,143 U.S. counties with official websites
Digital innovation leaders identified
County media hub URLs

Partnership Opportunity: NACo may provide bulk data export or API access for research/public benefit projects.

Phase 6: Federal Video Aggregators (OPTIONAL)

Priority: MEDIUM - Supplementary source

Tasks:

Create discovery/federal_video_sources.py
Compile federal agency channels (CDC, HRSA, etc.)
Compile state health department channels (all 50 states)
Add to video sources table

Use Case: State-level policy analysis, federal program tracking

Testing & Validation

Test the New Social Media Discovery

# 1. Install dependencies
pip install httpx beautifulsoup4

# 2. Run standalone test
cd /home/developer/projects/open-navigator
python discovery/social_media_discovery.py

# Expected output:
# ✓ Found 3 social media links for Seattle
#   youtube: 1 URLs
#   facebook: 1 URLs
#   twitter: 1 URLs
# ✓ Found 2 social media links for Chicago
#   youtube: 1 URLs
#   facebook: 1 URLs

Integration Test

# Full pipeline test
from discovery.discovery_pipeline import DiscoveryPipeline

pipeline = DiscoveryPipeline()

# 1. Discover jurisdictions (Census data)
# 2. Discover homepage URLs (GSA + patterns)
# 3. NEW: Discover social media (footer scraping)
# 4. Write all to Bronze layer

results = await pipeline.run_full_pipeline(
    limit=100,
    include_social_media=True  # NEW FLAG
)

Performance & Scalability

Current Approach (Dataset-Only)

✅ Fast (no web requests)
✅ Reliable (static datasets)
❌ Limited coverage (only cities in datasets)
❌ Stale data (datasets not updated frequently)

New Approach (Website Scraping)

⚠️ Slower (requires web requests)
⚠️ Less reliable (websites change)
✅ Comprehensive coverage (all cities with websites)
✅ Fresh data (real-time discovery)

Hybrid Strategy (RECOMMENDED)

Start with datasets (MeetingBank, Open States)
- Get 1,366 meetings with videos immediately
- High confidence, validated data
Supplement with website scraping
- Fill gaps for cities not in datasets
- Discover newly created channels
- Verify dataset URLs are still valid
Use USA.gov for verification
- Highest confidence for homepage URLs Curated Lists (NEW - YOUR SUGGESTIONS!):**

ELGL Top Channels: 50-100 most active channels 🔥
NACo Counties: 3,143 counties with official websites 🔥
NACo Digital Innovation: ~100 innovative counties

Website Scraping Discovery (NEW):

Major cities (100+ population): ~300 cities with YouTube
Medium cities (50k-100k): ~500 cities with social media
All municipalities: ~3,000-5,000 with public video channels

Total Potential:

3,000-5,000 YouTube channels for meeting videos
50-100 TOP-TIER channels (ELGL curated) 🌟
3,143 county websites (NACo database) 🌟
1,000+ Granicus portals with embedded videos
500+ Vimeo accounts
10,000+ Facebook pages (may have video links)

Quality Tiers:

Tier 1 (Highest): ELGL Top Channels - most active, best quality
Tier 2 (High): NACo Digital Innovation - county leaders
Tier 3 (Good): MeetingBank/Dataset channels - verified content
Tier 4 (Discovery): Website scraping - newly discoveredideo URLs

Priority 2: USA.gov verified (high confidence)

usa_gov_cities = [...] # ~3,000 verified .gov sites

Priority 3: Website scraping (for gaps)

remaining_cities = [...] # ~87,000 jurisdictions

Parallel processing

async def process_batch(cities, batch_size=50): for i in range(0, len(cities), batch_size): batch = cities[i:i+batch_size] results = await social_discovery.discover_batch(batch) save_to_bronze(results) await asyncio.sleep(5) # Rate limiting


---

## Expected Outcomes

### Coverage Estimates

**Dataset-Based Discovery:**
- MeetingBank: 6 cities ✅
- Open States: 50+ state legislatures ✅
- City Scrapers: 100-500 agencies ⚠️ (need to extract video links)

**Website Scraping Discovery (NEW):**
- Major cities (100+ population): ~300 cities with YouTube
- Medium cities (50k-100k): ~500 cities with social media
- All municipalities: ~3,000-5,000 with public video channels

**Total Potential:**
- **3,000-5,000 YouTube channels** for meeting videos
- **1,000+ Granicus portals** with embedded videos
- **500+ Vimeo accounts**
- **10,000+ Facebook pages** (may have video links)

---

## Next Steps

### Immediate Actions (This Week)

1. **Test Social Media Discovery** ✅ READY TO RUN
   ```bash
   python discovery/social_media_discovery.py

Integrate with Pipeline
- Add to discovery_pipeline.py
- Write results to Bronze layer
- Create bronze/social_media_channels table
Document Integration
- Update README with social media discovery
- Add examples to documentation

Short-term (Next 2 Weeks)

USA.gov Integration
- Create usa_gov_directory.py
- Scrape local directory
- Use as primary URL source
Enhanced MeetingBank Extraction
- Extract all video URLs from urls dictionary
- Test on all 1,366 meetings
- Validate YouTube links are still active
City Scrapers Video Links
- Update city_scrapers_urls.py
- Extract Granicus video URLs
- Crawl Granicus pages for embedded YouTube

Long-term (Next Month)

Federal Aggregators
- USA.gov/archive integration
- State health department channels
- CDC/HRSA video collections
Automated Validation
- Check if discovered channels still exist
- Verify channels have meeting content
- Score channels by video count and relevance
Scale to 1,000+ Cities
- Batch processing framework
- Parallel scraping with rate limiting
- Delta Lake storage for discovered channels

Conclusion

Summary

Current State:

✅ Video URLs extracted from datasets (1,366 meetings)
❌ No website scraping for social media links
❌ No USA.gov integration

Your Suggestion:

✅ Excellent idea! Website scraping is the missing piece
✅ USA.gov provides most authoritative homepage URLs
✅ Footer/contact page scraping will find channels

Implementation:

✅ Created social_media_discovery.py module
✅ Ready to test and integrate
✅ USA.gov integration guide provided
✅ Full roadmap for 1,000+ city coverage

Impact: Going from 6 cities with video URLs to 3,000-5,000 cities with YouTube channels will dramatically increase the reach of the Oral Health Policy Pulse system!

Video Channel Discovery: Current State & Enhancement Plan

Executive Summary

Current State: What Exists

✅ Video Discovery from Datasets (Offline Sources)

❌ What's Missing: Website Scraping for Social Media

Your Suggestion: Federal Aggregators & Website Scraping

Excellent Ideas!

1. USA.gov Local Directory

2. USA.gov/archive (Federal Videos)

NEW Implementation: Social Media Discovery

File: discovery/social_media_discovery.py

Integration with Existing Pipeline

How Social Media Discovery Fits In

Integration Code Example

USA.gov Integration Guide

Approach 1: Use USA.gov Local Directory as Homepage Source

Approach 2: USA.gov/archive for Federal Video Content

Approach 3: ELGL Top YouTube Channels (NEW - HIGHLY RECOMMENDED!)

Approach 4: NACo County Database (NEW - COMPREHENSIVE!)

Complete Implementation Plan

Phase 1: Enhance Existing Dataset Extraction (✅ DONE)

Phase 2: Website Social Media Discovery (✅ NEW MODULE CREATED)

Phase 3: USA.gov Integration (RECOMMENDED)

Phase 4: ELGL Curated Channels (NEW - HIGH PRIORITY!)

Phase 5: NACo County Database (NEW - HIGH PRIORITY!)

Phase 6: Federal Video Aggregators (OPTIONAL)

Testing & Validation

Test the New Social Media Discovery

Integration Test

Performance & Scalability

Current Approach (Dataset-Only)

New Approach (Website Scraping)

Hybrid Strategy (RECOMMENDED)

Priority 2: USA.gov verified (high confidence)

Priority 3: Website scraping (for gaps)

Parallel processing

Short-term (Next 2 Weeks)

Long-term (Next Month)

Conclusion

Summary

File: `discovery/social_media_discovery.py`