Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

File size: 18,100 Bytes

61d29fc

---
displayed_sidebar: developersSidebar
---

# Video Channel Discovery: Current State & Enhancement Plan

## Executive Summary

**Question:** Does this repo look at local government websites and attempt to discover their YouTube, Facebook, or other video channels?

**Answer:** 
- ❌ **Currently NO** - The repo does NOT scrape government websites for social media links
- ✅ **Partially YES** - It extracts video URLs from pre-existing datasets (MeetingBank, Open States)
- ✅ **NEW** - We've created `social_media_discovery.py` to implement your suggestion

---

## Current State: What Exists

### ✅ Video Discovery from Datasets (Offline Sources)

**1. MeetingBank (HuggingFace)**
- **File:** [`discovery/meetingbank_ingestion.py`](../discovery/meetingbank_ingestion.py)
- **Status:** ✅ **Working** - Video URLs ARE being extracted
- **Coverage:** 1,366 meetings from 6 cities
- **Video Sources:**
  - YouTube URLs (extracted from `urls['youtube_id']`)
  - Vimeo URLs (extracted from `urls['vimeo_id']`)
  - Archive.org collections (alameda, boston, denver, king-county, long-beach, seattle)

**2. Open States (API)**
- **File:** [`discovery/openstates_sources.py`](../discovery/openstates_sources.py)
- **Status:** ✅ Working
- **Coverage:** 50+ state legislatures
- **Extracts:** YouTube channels, Vimeo accounts, Granicus portals from jurisdiction metadata

**3. City Scrapers (GitHub)**
- **File:** [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py)
- **Status:** ⚠️ Partial - extracts start_urls but not video links yet
- **Coverage:** 100-500 agencies from Chicago, Pittsburgh, Detroit, Cleveland, LA
- **Note:** Granicus video pages with embedded YouTube, but extraction not fully implemented

### ❌ What's Missing: Website Scraping for Social Media

**Current Gap:**
The repo discovers government homepage URLs but does NOT:
1. ❌ Scrape those websites for social media links
2. ❌ Extract YouTube/Facebook channels from footers
3. ❌ Check "Contact Us" or "About" pages
4. ❌ Use USA.gov or federal aggregators

**Existing Homepage Discovery:**
- **File:** [`discovery/url_discovery_agent.py`](../discovery/url_discovery_agent.py)
- **What it does:**
  - ✅ Finds government homepage URLs using GSA .gov registry
  - ✅ Tests URL patterns (cityname.gov, etc.)
  - ✅ Crawls to find minutes/agenda pages
  - ❌ **But does NOT look for social media links**

---

## Your Suggestion: Federal Aggregators & Website Scraping

### Excellent Ideas!

#### 1. USA.gov Local Directory
```
✅ Most accurate way to verify channels are legitimate
✅ Provides official website for every city/county
✅ Most governments link social media in footer/contact sections
```

**Implementation:** See section below on how to integrate.

#### 2. USA.gov/archive (Federal Videos)
```
⚠️ Good for federal agencies
⚠️ Limited local government coverage
✅ Could supplement state-level sources
```

**Use Case:** State agencies and federal programs that touch local policy.

---

## NEW Implementation: Social Media Discovery

We've created a new module to implement your suggestion!

### File: `discovery/social_media_discovery.py`

**What it does:**
1. ✅ Takes government homepage URLs (from existing discovery)
2. ✅ Scrapes footer sections for social media links
3. ✅ Checks common contact/about pages
4. ✅ Extracts YouTube, Facebook, Vimeo, Archive.org, Granicus
5. ✅ Validates and cleans URLs
6. ✅ Batch processing for hundreds of jurisdictions

**Example Usage:**
```python
from discovery.social_media_discovery import SocialMediaDiscovery

jurisdictions = [
    {
        'jurisdiction_id': 'seattle-wa',
        'homepage_url': 'https://www.seattle.gov',
        'jurisdiction_name': 'Seattle',
        'state': 'WA'
    }
]

async with SocialMediaDiscovery() as discovery:
    results = await discovery.discover_batch(jurisdictions)

# Results:
# [
#   {
#     'jurisdiction_name': 'Seattle',
#     'social_media': {
#       'youtube': ['https://www.youtube.com/@cityofseattle'],
#       'facebook': ['https://www.facebook.com/CityofSeattle'],
#       'twitter': ['https://twitter.com/CityofSeattle']
#     },
#     'platform_count': 3,
#     'total_urls': 3
#   }
# ]
```

**Detection Strategy:**
- Focus on footer sections (most reliable)
- Common CSS selectors: `footer`, `[class*="footer"]`, `[class*="social"]`
- Pattern matching for platform URLs
- Validates against known domain patterns

---

## Integration with Existing Pipeline

### How Social Media Discovery Fits In

```
1. URL Discovery Agent (EXISTING)
   └─> Finds government homepage URLs
        └─> Uses GSA .gov registry
        └─> Pattern matching (cityname.gov)
        └─> Validates URLs

2. Social Media Discovery (NEW) ⬅️ Add this step
   └─> Scrapes homepages for social links
        └─> Checks footer sections
        └─> Checks contact/about pages
        └─> Extracts YouTube, Facebook, etc.

3. Meeting Scraper (EXISTING)
   └─> Uses discovered URLs to scrape meetings
```

### Integration Code Example

```python
from discovery.url_discovery_agent import URLDiscoveryAgent
from discovery.social_media_discovery import SocialMediaDiscovery
from discovery.gsa_domains import load_gsa_domains

# Step 1: Discover government websites
gsa_domains = load_gsa_domains()
url_agent = URLDiscoveryAgent(gsa_domains)

jurisdictions = [...]  # From Census data
discovered_urls = await url_agent.discover_batch(jurisdictions)

# Step 2: NEW - Discover social media from those websites
social_discovery = SocialMediaDiscovery()
social_results = await social_discovery.discover_batch(discovered_urls)

# Step 3: Save to Bronze layer
save_to_bronze_layer(social_results, "social_media_channels")
```

---

## USA.gov Integration Guide

### Approach 1: Use USA.gov Local Directory as Homepage Source

**What USA.gov Provides:**
- Official .gov website for every city/county
- Most authoritative source for homepage URLs
- Can replace/supplement GSA domain matching

**Implementation:**
```python
# discovery/usa_gov_directory.py (NEW FILE TO CREATE)

import httpx
from bs4 import BeautifulSoup

async def get_usa_gov_local_directory():
    """
    Scrape USA.gov local directory for official city/county websites.
    
    USA.gov maintains a directory at:
    https://www.usa.gov/local-governments
    
    Each state page lists cities/counties with official websites.
    """
    base_url = "https://www.usa.gov/local-governments"
    
    # 1. Get list of states
    # 2. For each state, get list of cities/counties
    # 3. Extract official website URLs
    # 4. Return structured data
    
    pass  # Implementation details


# Then use in url_discovery_agent.py:
def _match_usa_gov_directory(jurisdiction_name, state):
    """
    Match jurisdiction to USA.gov directory entry.
    
    Higher confidence than pattern matching because it's
    verified by federal government.
    """
    usa_gov_url = lookup_usa_gov_directory(jurisdiction_name, state)
    if usa_gov_url:
        return (usa_gov_url, 0.98)  # Very high confidence
    return None
```

### Approach 2: USA.gov/archive for Federal Video Content

**Use Case:** State health departments, federal programs

```python
# discovery/federal_video_sources.py (NEW FILE TO CREATE)

FEDERAL_VIDEO_CHANNELS = {
    "cdc": {
        "youtube": "https://www.youtube.com/@CDCgov",
        "topics": ["public_health", "oral_health"]
    },
    "hrsa": {
        "youtube": "https://www.youtube.com/@HRSAgov",
        "topics": ["health_centers", "dental_programs"]
    },
    "state_health_depts": {
        # Each state's health department
        "CA": "https://www.youtube.com/@CAPublicHealth",
        "TX": "https://www.youtube.com/@TXHealthHumanServices",
        # ... all 50 states
    }
}

def get_federal_video_sources():
    """
    Get federal agency video channels relevant to oral health policy.
    
    Sources:
    - usa.gov/archive featured channels
    - State health departments
    - CDC, HRSA, CMS channels
    """
    pass
```

### Approach 3: ELGL Top YouTube Channels (NEW - HIGHLY RECOMMENDED!)

**What:** ELGL (Engaging Local Government Leaders) publishes curated "Top Local Government YouTube Channels" lists

**Why This is Excellent:**
```
✅ Curated by experts (not automated scraping)
✅ Highlights MOST ACTIVE channels
✅ Quality > Quantity approach
✅ Updated annually
✅ Covers innovative local governments nationwide
```

**Sources:**
- ELGL Blog: https://elgl.org/
- Annual "Top Local Gov YouTube Channels" articles
- Digital innovation showcases

**Expected Coverage:** 50-100 top-tier channels (most active, highest quality)

**Implementation:** See `discovery/curated_sources.py`

### Approach 4: NACo County Database (NEW - COMPREHENSIVE!)

**What:** National Association of Counties maintains database of all 3,143 U.S. counties

**Why This is Excellent:**
```
✅ Complete county coverage (all 3,143 counties)
✅ Official website URLs verified by NACo
✅ Digital innovation showcase
✅ Authoritative source for county data
✅ Partnership opportunities
```

**Sources:**
- NACo County Explorer: https://ce.naco.org/
- Digital Counties Survey
- NACo Communications Awards

**Expected Coverage:** 3,143 counties with official websites

**Implementation:** See `discovery/curated_sources.py`

---

## Complete Implementation Plan

### Phase 1: Enhance Existing Dataset Extraction (✅ DONE)

- [x] MeetingBank video URLs (already working)
- [x] Open States channels (already working)
- [ ] City Scrapers Granicus video extraction (TODO)

### Phase 2: Website Social Media Discovery (✅ NEW MODULE CREATED)

**Implementation:**
1. [x] Create `social_media_discovery.py` module
2. [ ] Test on sample cities (Seattle, Chicago, Austin)
3. [ ] Integrate with URL discovery pipeline
4. [ ] Write to Bronze layer: `bronze/social_media_channels`

**Tasks:**
```bash
# Test the new module
cd discovery
python social_media_discovery.py

# Expected output: YouTube, Facebook, Vimeo URLs for test cities
```

### Phase 3: USA.gov Integration (RECOMMENDED)

**Priority: HIGH** - Most authoritative source

**Tasks:**
1. [ ] Create `discovery/usa_gov_directory.py`
2. [ ] Scrape USA.gov local directory for official URLs
3. [ ] Use as primary source (confidence 0.98)
4. [ ] Fallback to pattern matching for missing entries

**Estimated URLs:** 
- ~3,000 cities/counties with verified .gov URLs
- ~10,000+ municipalities (including .org, .us domains)

### Phase 4: ELGL Curated Channels (NEW - HIGH PRIORITY!)

**Priority: HIGH** - Quality over quantity

**Tasks:**
1. [x] Create `discovery/curated_sources.py` ✅
2. [ ] Scrape ELGL "Top YouTube Channels" articles
3. [ ] Parse channel URLs and metadata
4. [ ] Flag as "top-ranked" in database

**Expected Results:**
- 50-100 most active local government channels
- High-quality, verified content
- Innovative digital communication examples

**Why This Matters:**
These are the channels with the MOST meeting videos and BEST production quality!

### Phase 5: NACo County Database (NEW - HIGH PRIORITY!)

**Priority: HIGH** - Comprehensive county coverage

**Tasks:**
1. [x] Create `discovery/curated_sources.py` ✅
2. [ ] Contact NACo for data partnership/export
3. [ ] Integrate NACo County Explorer data
4. [ ] Scrape digital innovation showcase
5. [ ] Cross-reference with GSA .gov domains

**Expected Results:**
- All 3,143 U.S. counties with official websites
- Digital innovation leaders identified
- County media hub URLs

**Partnership Opportunity:**
NACo may provide bulk data export or API access for research/public benefit projects.

### Phase 6: Federal Video Aggregators (OPTIONAL)

**Priority: MEDIUM** - Supplementary source

**Tasks:**
1. [ ] Create `discovery/federal_video_sources.py`
2. [ ] Compile federal agency channels (CDC, HRSA, etc.)
3. [ ] Compile state health department channels (all 50 states)
4. [ ] Add to video sources table

**Use Case:** State-level policy analysis, federal program tracking

---

## Testing & Validation

### Test the New Social Media Discovery

```bash
# 1. Install dependencies
pip install httpx beautifulsoup4

# 2. Run standalone test
cd /home/developer/projects/open-navigator
python discovery/social_media_discovery.py

# Expected output:
# ✓ Found 3 social media links for Seattle
#   youtube: 1 URLs
#   facebook: 1 URLs
#   twitter: 1 URLs
# ✓ Found 2 social media links for Chicago
#   youtube: 1 URLs
#   facebook: 1 URLs
```

### Integration Test

```python
# Full pipeline test
from discovery.discovery_pipeline import DiscoveryPipeline

pipeline = DiscoveryPipeline()

# 1. Discover jurisdictions (Census data)
# 2. Discover homepage URLs (GSA + patterns)
# 3. NEW: Discover social media (footer scraping)
# 4. Write all to Bronze layer

results = await pipeline.run_full_pipeline(
    limit=100,
    include_social_media=True  # NEW FLAG
)
```

---

## Performance & Scalability

### Current Approach (Dataset-Only)
- ✅ Fast (no web requests)
- ✅ Reliable (static datasets)
- ❌ Limited coverage (only cities in datasets)
- ❌ Stale data (datasets not updated frequently)

### New Approach (Website Scraping)
- ⚠️ Slower (requires web requests)
- ⚠️ Less reliable (websites change)
- ✅ Comprehensive coverage (all cities with websites)
- ✅ Fresh data (real-time discovery)

### Hybrid Strategy (RECOMMENDED)
1. **Start with datasets** (MeetingBank, Open States)
   - Get 1,366 meetings with videos immediately
   - High confidence, validated data
   
2. **Supplement with website scraping**
   - Fill gaps for cities not in datasets
   - Discover newly created channels
   - Verify dataset URLs are still valid

3. **Use USA.gov for verification**
   - Highest confidence for homepage URLs
  Curated Lists (NEW - YOUR SUGGESTIONS!):**
- ELGL Top Channels: 50-100 most active channels 🔥
- NACo Counties: 3,143 counties with official websites 🔥
- NACo Digital Innovation: ~100 innovative counties

**Website Scraping Discovery (NEW):**
- Major cities (100+ population): ~300 cities with YouTube
- Medium cities (50k-100k): ~500 cities with social media
- All municipalities: ~3,000-5,000 with public video channels

**Total Potential:**
- **3,000-5,000 YouTube channels** for meeting videos
- **50-100 TOP-TIER channels** (ELGL curated) 🌟
- **3,143 county websites** (NACo database) 🌟
- **1,000+ Granicus portals** with embedded videos
- **500+ Vimeo accounts**
- **10,000+ Facebook pages** (may have video links)

**Quality Tiers:**
1. **Tier 1 (Highest):** ELGL Top Channels - most active, best quality
2. **Tier 2 (High):** NACo Digital Innovation - county leaders
3. **Tier 3 (Good):** MeetingBank/Dataset channels - verified content
4. **Tier 4 (Discovery):** Website scraping - newly discoveredideo URLs

# Priority 2: USA.gov verified (high confidence)
usa_gov_cities = [...]  # ~3,000 verified .gov sites

# Priority 3: Website scraping (for gaps)
remaining_cities = [...]  # ~87,000 jurisdictions

# Parallel processing
async def process_batch(cities, batch_size=50):
    for i in range(0, len(cities), batch_size):
        batch = cities[i:i+batch_size]
        results = await social_discovery.discover_batch(batch)
        save_to_bronze(results)
        await asyncio.sleep(5)  # Rate limiting
```

---

## Expected Outcomes

### Coverage Estimates

**Dataset-Based Discovery:**
- MeetingBank: 6 cities ✅
- Open States: 50+ state legislatures ✅
- City Scrapers: 100-500 agencies ⚠️ (need to extract video links)

**Website Scraping Discovery (NEW):**
- Major cities (100+ population): ~300 cities with YouTube
- Medium cities (50k-100k): ~500 cities with social media
- All municipalities: ~3,000-5,000 with public video channels

**Total Potential:**
- **3,000-5,000 YouTube channels** for meeting videos
- **1,000+ Granicus portals** with embedded videos
- **500+ Vimeo accounts**
- **10,000+ Facebook pages** (may have video links)

---

## Next Steps

### Immediate Actions (This Week)

1. **Test Social Media Discovery** ✅ READY TO RUN
   ```bash
   python discovery/social_media_discovery.py
   ```

2. **Integrate with Pipeline**
   - Add to `discovery_pipeline.py`
   - Write results to Bronze layer
   - Create `bronze/social_media_channels` table

3. **Document Integration**
   - Update README with social media discovery
   - Add examples to documentation

### Short-term (Next 2 Weeks)

1. **USA.gov Integration**
   - Create `usa_gov_directory.py`
   - Scrape local directory
   - Use as primary URL source

2. **Enhanced MeetingBank Extraction**
   - Extract all video URLs from `urls` dictionary
   - Test on all 1,366 meetings
   - Validate YouTube links are still active

3. **City Scrapers Video Links**
   - Update `city_scrapers_urls.py`
   - Extract Granicus video URLs
   - Crawl Granicus pages for embedded YouTube

### Long-term (Next Month)

1. **Federal Aggregators**
   - USA.gov/archive integration
   - State health department channels
   - CDC/HRSA video collections

2. **Automated Validation**
   - Check if discovered channels still exist
   - Verify channels have meeting content
   - Score channels by video count and relevance

3. **Scale to 1,000+ Cities**
   - Batch processing framework
   - Parallel scraping with rate limiting
   - Delta Lake storage for discovered channels

---

## Conclusion

### Summary

**Current State:**
- ✅ Video URLs extracted from datasets (1,366 meetings)
- ❌ No website scraping for social media links
- ❌ No USA.gov integration

**Your Suggestion:**
- ✅ **Excellent idea!** Website scraping is the missing piece
- ✅ USA.gov provides most authoritative homepage URLs
- ✅ Footer/contact page scraping will find channels

**Implementation:**
- ✅ Created `social_media_discovery.py` module
- ✅ Ready to test and integrate
- ✅ USA.gov integration guide provided
- ✅ Full roadmap for 1,000+ city coverage

**Impact:**
Going from 6 cities with video URLs to **3,000-5,000 cities with YouTube channels** will dramatically increase the reach of the Oral Health Policy Pulse system!