Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / website /docs /data-sources /video-channels.md

jcbowyer

Clean HuggingFace deployment without binary files

61d29fc 29 days ago

preview code

raw

history blame contribute delete

18.1 kB

	---
	displayed_sidebar: developersSidebar
	---

	# Video Channel Discovery: Current State & Enhancement Plan

	## Executive Summary

	Question: Does this repo look at local government websites and attempt to discover their YouTube, Facebook, or other video channels?

	Answer:
	- ❌ Currently NO - The repo does NOT scrape government websites for social media links
	- ✅ Partially YES - It extracts video URLs from pre-existing datasets (MeetingBank, Open States)
	- ✅ NEW - We've created `social_media_discovery.py` to implement your suggestion

	---

	## Current State: What Exists

	### ✅ Video Discovery from Datasets (Offline Sources)

	1. MeetingBank (HuggingFace)
	- File: [`discovery/meetingbank_ingestion.py`](../discovery/meetingbank_ingestion.py)
	- Status: ✅ Working - Video URLs ARE being extracted
	- Coverage: 1,366 meetings from 6 cities
	- Video Sources:
	- YouTube URLs (extracted from `urls['youtube_id']`)
	- Vimeo URLs (extracted from `urls['vimeo_id']`)
	- Archive.org collections (alameda, boston, denver, king-county, long-beach, seattle)

	2. Open States (API)
	- File: [`discovery/openstates_sources.py`](../discovery/openstates_sources.py)
	- Status: ✅ Working
	- Coverage: 50+ state legislatures
	- Extracts: YouTube channels, Vimeo accounts, Granicus portals from jurisdiction metadata

	3. City Scrapers (GitHub)
	- File: [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py)
	- Status: ⚠️ Partial - extracts start_urls but not video links yet
	- Coverage: 100-500 agencies from Chicago, Pittsburgh, Detroit, Cleveland, LA
	- Note: Granicus video pages with embedded YouTube, but extraction not fully implemented

	### ❌ What's Missing: Website Scraping for Social Media

	Current Gap:
	The repo discovers government homepage URLs but does NOT:
	1. ❌ Scrape those websites for social media links
	2. ❌ Extract YouTube/Facebook channels from footers
	3. ❌ Check "Contact Us" or "About" pages
	4. ❌ Use USA.gov or federal aggregators

	Existing Homepage Discovery:
	- File: [`discovery/url_discovery_agent.py`](../discovery/url_discovery_agent.py)
	- What it does:
	- ✅ Finds government homepage URLs using GSA .gov registry
	- ✅ Tests URL patterns (cityname.gov, etc.)
	- ✅ Crawls to find minutes/agenda pages
	- ❌ But does NOT look for social media links

	---

	## Your Suggestion: Federal Aggregators & Website Scraping

	### Excellent Ideas!

	#### 1. USA.gov Local Directory
	```
	✅ Most accurate way to verify channels are legitimate
	✅ Provides official website for every city/county
	✅ Most governments link social media in footer/contact sections
	```

	Implementation: See section below on how to integrate.

	#### 2. USA.gov/archive (Federal Videos)
	```
	⚠️ Good for federal agencies
	⚠️ Limited local government coverage
	✅ Could supplement state-level sources
	```

	Use Case: State agencies and federal programs that touch local policy.

	---

	## NEW Implementation: Social Media Discovery

	We've created a new module to implement your suggestion!

	### File: `discovery/social_media_discovery.py`

	What it does:
	1. ✅ Takes government homepage URLs (from existing discovery)
	2. ✅ Scrapes footer sections for social media links
	3. ✅ Checks common contact/about pages
	4. ✅ Extracts YouTube, Facebook, Vimeo, Archive.org, Granicus
	5. ✅ Validates and cleans URLs
	6. ✅ Batch processing for hundreds of jurisdictions

	Example Usage:
	```python
	from discovery.social_media_discovery import SocialMediaDiscovery

	jurisdictions = [
	{
	'jurisdiction_id': 'seattle-wa',
	'homepage_url': 'https://www.seattle.gov',
	'jurisdiction_name': 'Seattle',
	'state': 'WA'
	}
	]

	async with SocialMediaDiscovery() as discovery:
	results = await discovery.discover_batch(jurisdictions)

	# Results:
	# [
	# {
	# 'jurisdiction_name': 'Seattle',
	# 'social_media': {
	# 'youtube': ['https://www.youtube.com/@cityofseattle'],
	# 'facebook': ['https://www.facebook.com/CityofSeattle'],
	# 'twitter': ['https://twitter.com/CityofSeattle']
	# },
	# 'platform_count': 3,
	# 'total_urls': 3
	# }
	# ]
	```

	Detection Strategy:
	- Focus on footer sections (most reliable)
	- Common CSS selectors: `footer`, `[class="footer"]`, `[class="social"]`
	- Pattern matching for platform URLs
	- Validates against known domain patterns

	---

	## Integration with Existing Pipeline

	### How Social Media Discovery Fits In

	```
	1. URL Discovery Agent (EXISTING)
	└─> Finds government homepage URLs
	└─> Uses GSA .gov registry
	└─> Pattern matching (cityname.gov)
	└─> Validates URLs

	2. Social Media Discovery (NEW) ⬅️ Add this step
	└─> Scrapes homepages for social links
	└─> Checks footer sections
	└─> Checks contact/about pages
	└─> Extracts YouTube, Facebook, etc.

	3. Meeting Scraper (EXISTING)
	└─> Uses discovered URLs to scrape meetings
	```

	### Integration Code Example

	```python
	from discovery.url_discovery_agent import URLDiscoveryAgent
	from discovery.social_media_discovery import SocialMediaDiscovery
	from discovery.gsa_domains import load_gsa_domains

	# Step 1: Discover government websites
	gsa_domains = load_gsa_domains()
	url_agent = URLDiscoveryAgent(gsa_domains)

	jurisdictions = [...] # From Census data
	discovered_urls = await url_agent.discover_batch(jurisdictions)

	# Step 2: NEW - Discover social media from those websites
	social_discovery = SocialMediaDiscovery()
	social_results = await social_discovery.discover_batch(discovered_urls)

	# Step 3: Save to Bronze layer
	save_to_bronze_layer(social_results, "social_media_channels")
	```

	---

	## USA.gov Integration Guide

	### Approach 1: Use USA.gov Local Directory as Homepage Source

	What USA.gov Provides:
	- Official .gov website for every city/county
	- Most authoritative source for homepage URLs
	- Can replace/supplement GSA domain matching

	Implementation:
	```python
	# discovery/usa_gov_directory.py (NEW FILE TO CREATE)

	import httpx
	from bs4 import BeautifulSoup

	async def get_usa_gov_local_directory():
	"""
	Scrape USA.gov local directory for official city/county websites.

	USA.gov maintains a directory at:
	https://www.usa.gov/local-governments

	Each state page lists cities/counties with official websites.
	"""
	base_url = "https://www.usa.gov/local-governments"

	# 1. Get list of states
	# 2. For each state, get list of cities/counties
	# 3. Extract official website URLs
	# 4. Return structured data

	pass # Implementation details


	# Then use in url_discovery_agent.py:
	def _match_usa_gov_directory(jurisdiction_name, state):
	"""
	Match jurisdiction to USA.gov directory entry.

	Higher confidence than pattern matching because it's
	verified by federal government.
	"""
	usa_gov_url = lookup_usa_gov_directory(jurisdiction_name, state)
	if usa_gov_url:
	return (usa_gov_url, 0.98) # Very high confidence
	return None
	```

	### Approach 2: USA.gov/archive for Federal Video Content

	Use Case: State health departments, federal programs

	```python
	# discovery/federal_video_sources.py (NEW FILE TO CREATE)

	FEDERAL_VIDEO_CHANNELS = {
	"cdc": {
	"youtube": "https://www.youtube.com/@CDCgov",
	"topics": ["public_health", "oral_health"]
	},
	"hrsa": {
	"youtube": "https://www.youtube.com/@HRSAgov",
	"topics": ["health_centers", "dental_programs"]
	},
	"state_health_depts": {
	# Each state's health department
	"CA": "https://www.youtube.com/@CAPublicHealth",
	"TX": "https://www.youtube.com/@TXHealthHumanServices",
	# ... all 50 states
	}
	}

	def get_federal_video_sources():
	"""
	Get federal agency video channels relevant to oral health policy.

	Sources:
	- usa.gov/archive featured channels
	- State health departments
	- CDC, HRSA, CMS channels
	"""
	pass
	```

	### Approach 3: ELGL Top YouTube Channels (NEW - HIGHLY RECOMMENDED!)

	What: ELGL (Engaging Local Government Leaders) publishes curated "Top Local Government YouTube Channels" lists

	Why This is Excellent:
	```
	✅ Curated by experts (not automated scraping)
	✅ Highlights MOST ACTIVE channels
	✅ Quality > Quantity approach
	✅ Updated annually
	✅ Covers innovative local governments nationwide
	```

	Sources:
	- ELGL Blog: https://elgl.org/
	- Annual "Top Local Gov YouTube Channels" articles
	- Digital innovation showcases

	Expected Coverage: 50-100 top-tier channels (most active, highest quality)

	Implementation: See `discovery/curated_sources.py`

	### Approach 4: NACo County Database (NEW - COMPREHENSIVE!)

	What: National Association of Counties maintains database of all 3,143 U.S. counties

	Why This is Excellent:
	```
	✅ Complete county coverage (all 3,143 counties)
	✅ Official website URLs verified by NACo
	✅ Digital innovation showcase
	✅ Authoritative source for county data
	✅ Partnership opportunities
	```

	Sources:
	- NACo County Explorer: https://ce.naco.org/
	- Digital Counties Survey
	- NACo Communications Awards

	Expected Coverage: 3,143 counties with official websites

	Implementation: See `discovery/curated_sources.py`

	---

	## Complete Implementation Plan

	### Phase 1: Enhance Existing Dataset Extraction (✅ DONE)

	- [x] MeetingBank video URLs (already working)
	- [x] Open States channels (already working)
	- [ ] City Scrapers Granicus video extraction (TODO)

	### Phase 2: Website Social Media Discovery (✅ NEW MODULE CREATED)

	Implementation:
	1. [x] Create `social_media_discovery.py` module
	2. [ ] Test on sample cities (Seattle, Chicago, Austin)
	3. [ ] Integrate with URL discovery pipeline
	4. [ ] Write to Bronze layer: `bronze/social_media_channels`

	Tasks:
	```bash
	# Test the new module
	cd discovery
	python social_media_discovery.py

	# Expected output: YouTube, Facebook, Vimeo URLs for test cities
	```

	### Phase 3: USA.gov Integration (RECOMMENDED)

	Priority: HIGH - Most authoritative source

	Tasks:
	1. [ ] Create `discovery/usa_gov_directory.py`
	2. [ ] Scrape USA.gov local directory for official URLs
	3. [ ] Use as primary source (confidence 0.98)
	4. [ ] Fallback to pattern matching for missing entries

	Estimated URLs:
	- ~3,000 cities/counties with verified .gov URLs
	- ~10,000+ municipalities (including .org, .us domains)

	### Phase 4: ELGL Curated Channels (NEW - HIGH PRIORITY!)

	Priority: HIGH - Quality over quantity

	Tasks:
	1. [x] Create `discovery/curated_sources.py` ✅
	2. [ ] Scrape ELGL "Top YouTube Channels" articles
	3. [ ] Parse channel URLs and metadata
	4. [ ] Flag as "top-ranked" in database

	Expected Results:
	- 50-100 most active local government channels
	- High-quality, verified content
	- Innovative digital communication examples

	Why This Matters:
	These are the channels with the MOST meeting videos and BEST production quality!

	### Phase 5: NACo County Database (NEW - HIGH PRIORITY!)

	Priority: HIGH - Comprehensive county coverage

	Tasks:
	1. [x] Create `discovery/curated_sources.py` ✅
	2. [ ] Contact NACo for data partnership/export
	3. [ ] Integrate NACo County Explorer data
	4. [ ] Scrape digital innovation showcase
	5. [ ] Cross-reference with GSA .gov domains

	Expected Results:
	- All 3,143 U.S. counties with official websites
	- Digital innovation leaders identified
	- County media hub URLs

	Partnership Opportunity:
	NACo may provide bulk data export or API access for research/public benefit projects.

	### Phase 6: Federal Video Aggregators (OPTIONAL)

	Priority: MEDIUM - Supplementary source

	Tasks:
	1. [ ] Create `discovery/federal_video_sources.py`
	2. [ ] Compile federal agency channels (CDC, HRSA, etc.)
	3. [ ] Compile state health department channels (all 50 states)
	4. [ ] Add to video sources table

	Use Case: State-level policy analysis, federal program tracking

	---

	## Testing & Validation

	### Test the New Social Media Discovery

	```bash
	# 1. Install dependencies
	pip install httpx beautifulsoup4

	# 2. Run standalone test
	cd /home/developer/projects/open-navigator
	python discovery/social_media_discovery.py

	# Expected output:
	# ✓ Found 3 social media links for Seattle
	# youtube: 1 URLs
	# facebook: 1 URLs
	# twitter: 1 URLs
	# ✓ Found 2 social media links for Chicago
	# youtube: 1 URLs
	# facebook: 1 URLs
	```

	### Integration Test

	```python
	# Full pipeline test
	from discovery.discovery_pipeline import DiscoveryPipeline

	pipeline = DiscoveryPipeline()

	# 1. Discover jurisdictions (Census data)
	# 2. Discover homepage URLs (GSA + patterns)
	# 3. NEW: Discover social media (footer scraping)
	# 4. Write all to Bronze layer

	results = await pipeline.run_full_pipeline(
	limit=100,
	include_social_media=True # NEW FLAG
	)
	```

	---

	## Performance & Scalability

	### Current Approach (Dataset-Only)
	- ✅ Fast (no web requests)
	- ✅ Reliable (static datasets)
	- ❌ Limited coverage (only cities in datasets)
	- ❌ Stale data (datasets not updated frequently)

	### New Approach (Website Scraping)
	- ⚠️ Slower (requires web requests)
	- ⚠️ Less reliable (websites change)
	- ✅ Comprehensive coverage (all cities with websites)
	- ✅ Fresh data (real-time discovery)

	### Hybrid Strategy (RECOMMENDED)
	1. Start with datasets (MeetingBank, Open States)
	- Get 1,366 meetings with videos immediately
	- High confidence, validated data

	2. Supplement with website scraping
	- Fill gaps for cities not in datasets
	- Discover newly created channels
	- Verify dataset URLs are still valid

	3. Use USA.gov for verification
	- Highest confidence for homepage URLs
	Curated Lists (NEW - YOUR SUGGESTIONS!):**
	- ELGL Top Channels: 50-100 most active channels 🔥
	- NACo Counties: 3,143 counties with official websites 🔥
	- NACo Digital Innovation: ~100 innovative counties

	Website Scraping Discovery (NEW):
	- Major cities (100+ population): ~300 cities with YouTube
	- Medium cities (50k-100k): ~500 cities with social media
	- All municipalities: ~3,000-5,000 with public video channels

	Total Potential:
	- 3,000-5,000 YouTube channels for meeting videos
	- 50-100 TOP-TIER channels (ELGL curated) 🌟
	- 3,143 county websites (NACo database) 🌟
	- 1,000+ Granicus portals with embedded videos
	- 500+ Vimeo accounts
	- 10,000+ Facebook pages (may have video links)

	Quality Tiers:
	1. Tier 1 (Highest): ELGL Top Channels - most active, best quality
	2. Tier 2 (High): NACo Digital Innovation - county leaders
	3. Tier 3 (Good): MeetingBank/Dataset channels - verified content
	4. Tier 4 (Discovery): Website scraping - newly discoveredideo URLs

	# Priority 2: USA.gov verified (high confidence)
	usa_gov_cities = [...] # ~3,000 verified .gov sites

	# Priority 3: Website scraping (for gaps)
	remaining_cities = [...] # ~87,000 jurisdictions

	# Parallel processing
	async def process_batch(cities, batch_size=50):
	for i in range(0, len(cities), batch_size):
	batch = cities[i:i+batch_size]
	results = await social_discovery.discover_batch(batch)
	save_to_bronze(results)
	await asyncio.sleep(5) # Rate limiting
	```

	---

	## Expected Outcomes

	### Coverage Estimates

	Dataset-Based Discovery:
	- MeetingBank: 6 cities ✅
	- Open States: 50+ state legislatures ✅
	- City Scrapers: 100-500 agencies ⚠️ (need to extract video links)

	Website Scraping Discovery (NEW):
	- Major cities (100+ population): ~300 cities with YouTube
	- Medium cities (50k-100k): ~500 cities with social media
	- All municipalities: ~3,000-5,000 with public video channels

	Total Potential:
	- 3,000-5,000 YouTube channels for meeting videos
	- 1,000+ Granicus portals with embedded videos
	- 500+ Vimeo accounts
	- 10,000+ Facebook pages (may have video links)

	---

	## Next Steps

	### Immediate Actions (This Week)

	1. Test Social Media Discovery ✅ READY TO RUN
	```bash
	python discovery/social_media_discovery.py
	```

	2. Integrate with Pipeline
	- Add to `discovery_pipeline.py`
	- Write results to Bronze layer
	- Create `bronze/social_media_channels` table

	3. Document Integration
	- Update README with social media discovery
	- Add examples to documentation

	### Short-term (Next 2 Weeks)

	1. USA.gov Integration
	- Create `usa_gov_directory.py`
	- Scrape local directory
	- Use as primary URL source

	2. Enhanced MeetingBank Extraction
	- Extract all video URLs from `urls` dictionary
	- Test on all 1,366 meetings
	- Validate YouTube links are still active

	3. City Scrapers Video Links
	- Update `city_scrapers_urls.py`
	- Extract Granicus video URLs
	- Crawl Granicus pages for embedded YouTube

	### Long-term (Next Month)

	1. Federal Aggregators
	- USA.gov/archive integration
	- State health department channels
	- CDC/HRSA video collections

	2. Automated Validation
	- Check if discovered channels still exist
	- Verify channels have meeting content
	- Score channels by video count and relevance

	3. Scale to 1,000+ Cities
	- Batch processing framework
	- Parallel scraping with rate limiting
	- Delta Lake storage for discovered channels

	---

	## Conclusion

	### Summary

	Current State:
	- ✅ Video URLs extracted from datasets (1,366 meetings)
	- ❌ No website scraping for social media links
	- ❌ No USA.gov integration

	Your Suggestion:
	- ✅ Excellent idea! Website scraping is the missing piece
	- ✅ USA.gov provides most authoritative homepage URLs
	- ✅ Footer/contact page scraping will find channels

	Implementation:
	- ✅ Created `social_media_discovery.py` module
	- ✅ Ready to test and integrate
	- ✅ USA.gov integration guide provided
	- ✅ Full roadmap for 1,000+ city coverage

	Impact:
	Going from 6 cities with video URLs to 3,000-5,000 cities with YouTube channels will dramatically increase the reach of the Oral Health Policy Pulse system!