Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| displayed_sidebar: developersSidebar | |
| # Video Channel Discovery: Current State & Enhancement Plan | |
| ## Executive Summary | |
| **Question:** Does this repo look at local government websites and attempt to discover their YouTube, Facebook, or other video channels? | |
| **Answer:** | |
| - β **Currently NO** - The repo does NOT scrape government websites for social media links | |
| - β **Partially YES** - It extracts video URLs from pre-existing datasets (MeetingBank, Open States) | |
| - β **NEW** - We've created `social_media_discovery.py` to implement your suggestion | |
| --- | |
| ## Current State: What Exists | |
| ### β Video Discovery from Datasets (Offline Sources) | |
| **1. MeetingBank (HuggingFace)** | |
| - **File:** [`discovery/meetingbank_ingestion.py`](../discovery/meetingbank_ingestion.py) | |
| - **Status:** β **Working** - Video URLs ARE being extracted | |
| - **Coverage:** 1,366 meetings from 6 cities | |
| - **Video Sources:** | |
| - YouTube URLs (extracted from `urls['youtube_id']`) | |
| - Vimeo URLs (extracted from `urls['vimeo_id']`) | |
| - Archive.org collections (alameda, boston, denver, king-county, long-beach, seattle) | |
| **2. Open States (API)** | |
| - **File:** [`discovery/openstates_sources.py`](../discovery/openstates_sources.py) | |
| - **Status:** β Working | |
| - **Coverage:** 50+ state legislatures | |
| - **Extracts:** YouTube channels, Vimeo accounts, Granicus portals from jurisdiction metadata | |
| **3. City Scrapers (GitHub)** | |
| - **File:** [`discovery/city_scrapers_urls.py`](../discovery/city_scrapers_urls.py) | |
| - **Status:** β οΈ Partial - extracts start_urls but not video links yet | |
| - **Coverage:** 100-500 agencies from Chicago, Pittsburgh, Detroit, Cleveland, LA | |
| - **Note:** Granicus video pages with embedded YouTube, but extraction not fully implemented | |
| ### β What's Missing: Website Scraping for Social Media | |
| **Current Gap:** | |
| The repo discovers government homepage URLs but does NOT: | |
| 1. β Scrape those websites for social media links | |
| 2. β Extract YouTube/Facebook channels from footers | |
| 3. β Check "Contact Us" or "About" pages | |
| 4. β Use USA.gov or federal aggregators | |
| **Existing Homepage Discovery:** | |
| - **File:** [`discovery/url_discovery_agent.py`](../discovery/url_discovery_agent.py) | |
| - **What it does:** | |
| - β Finds government homepage URLs using GSA .gov registry | |
| - β Tests URL patterns (cityname.gov, etc.) | |
| - β Crawls to find minutes/agenda pages | |
| - β **But does NOT look for social media links** | |
| --- | |
| ## Your Suggestion: Federal Aggregators & Website Scraping | |
| ### Excellent Ideas! | |
| #### 1. USA.gov Local Directory | |
| ``` | |
| β Most accurate way to verify channels are legitimate | |
| β Provides official website for every city/county | |
| β Most governments link social media in footer/contact sections | |
| ``` | |
| **Implementation:** See section below on how to integrate. | |
| #### 2. USA.gov/archive (Federal Videos) | |
| ``` | |
| β οΈ Good for federal agencies | |
| β οΈ Limited local government coverage | |
| β Could supplement state-level sources | |
| ``` | |
| **Use Case:** State agencies and federal programs that touch local policy. | |
| --- | |
| ## NEW Implementation: Social Media Discovery | |
| We've created a new module to implement your suggestion! | |
| ### File: `discovery/social_media_discovery.py` | |
| **What it does:** | |
| 1. β Takes government homepage URLs (from existing discovery) | |
| 2. β Scrapes footer sections for social media links | |
| 3. β Checks common contact/about pages | |
| 4. β Extracts YouTube, Facebook, Vimeo, Archive.org, Granicus | |
| 5. β Validates and cleans URLs | |
| 6. β Batch processing for hundreds of jurisdictions | |
| **Example Usage:** | |
| ```python | |
| from discovery.social_media_discovery import SocialMediaDiscovery | |
| jurisdictions = [ | |
| { | |
| 'jurisdiction_id': 'seattle-wa', | |
| 'homepage_url': 'https://www.seattle.gov', | |
| 'jurisdiction_name': 'Seattle', | |
| 'state': 'WA' | |
| } | |
| ] | |
| async with SocialMediaDiscovery() as discovery: | |
| results = await discovery.discover_batch(jurisdictions) | |
| # Results: | |
| # [ | |
| # { | |
| # 'jurisdiction_name': 'Seattle', | |
| # 'social_media': { | |
| # 'youtube': ['https://www.youtube.com/@cityofseattle'], | |
| # 'facebook': ['https://www.facebook.com/CityofSeattle'], | |
| # 'twitter': ['https://twitter.com/CityofSeattle'] | |
| # }, | |
| # 'platform_count': 3, | |
| # 'total_urls': 3 | |
| # } | |
| # ] | |
| ``` | |
| **Detection Strategy:** | |
| - Focus on footer sections (most reliable) | |
| - Common CSS selectors: `footer`, `[class*="footer"]`, `[class*="social"]` | |
| - Pattern matching for platform URLs | |
| - Validates against known domain patterns | |
| --- | |
| ## Integration with Existing Pipeline | |
| ### How Social Media Discovery Fits In | |
| ``` | |
| 1. URL Discovery Agent (EXISTING) | |
| ββ> Finds government homepage URLs | |
| ββ> Uses GSA .gov registry | |
| ββ> Pattern matching (cityname.gov) | |
| ββ> Validates URLs | |
| 2. Social Media Discovery (NEW) β¬ οΈ Add this step | |
| ββ> Scrapes homepages for social links | |
| ββ> Checks footer sections | |
| ββ> Checks contact/about pages | |
| ββ> Extracts YouTube, Facebook, etc. | |
| 3. Meeting Scraper (EXISTING) | |
| ββ> Uses discovered URLs to scrape meetings | |
| ``` | |
| ### Integration Code Example | |
| ```python | |
| from discovery.url_discovery_agent import URLDiscoveryAgent | |
| from discovery.social_media_discovery import SocialMediaDiscovery | |
| from discovery.gsa_domains import load_gsa_domains | |
| # Step 1: Discover government websites | |
| gsa_domains = load_gsa_domains() | |
| url_agent = URLDiscoveryAgent(gsa_domains) | |
| jurisdictions = [...] # From Census data | |
| discovered_urls = await url_agent.discover_batch(jurisdictions) | |
| # Step 2: NEW - Discover social media from those websites | |
| social_discovery = SocialMediaDiscovery() | |
| social_results = await social_discovery.discover_batch(discovered_urls) | |
| # Step 3: Save to Bronze layer | |
| save_to_bronze_layer(social_results, "social_media_channels") | |
| ``` | |
| --- | |
| ## USA.gov Integration Guide | |
| ### Approach 1: Use USA.gov Local Directory as Homepage Source | |
| **What USA.gov Provides:** | |
| - Official .gov website for every city/county | |
| - Most authoritative source for homepage URLs | |
| - Can replace/supplement GSA domain matching | |
| **Implementation:** | |
| ```python | |
| # discovery/usa_gov_directory.py (NEW FILE TO CREATE) | |
| import httpx | |
| from bs4 import BeautifulSoup | |
| async def get_usa_gov_local_directory(): | |
| """ | |
| Scrape USA.gov local directory for official city/county websites. | |
| USA.gov maintains a directory at: | |
| https://www.usa.gov/local-governments | |
| Each state page lists cities/counties with official websites. | |
| """ | |
| base_url = "https://www.usa.gov/local-governments" | |
| # 1. Get list of states | |
| # 2. For each state, get list of cities/counties | |
| # 3. Extract official website URLs | |
| # 4. Return structured data | |
| pass # Implementation details | |
| # Then use in url_discovery_agent.py: | |
| def _match_usa_gov_directory(jurisdiction_name, state): | |
| """ | |
| Match jurisdiction to USA.gov directory entry. | |
| Higher confidence than pattern matching because it's | |
| verified by federal government. | |
| """ | |
| usa_gov_url = lookup_usa_gov_directory(jurisdiction_name, state) | |
| if usa_gov_url: | |
| return (usa_gov_url, 0.98) # Very high confidence | |
| return None | |
| ``` | |
| ### Approach 2: USA.gov/archive for Federal Video Content | |
| **Use Case:** State health departments, federal programs | |
| ```python | |
| # discovery/federal_video_sources.py (NEW FILE TO CREATE) | |
| FEDERAL_VIDEO_CHANNELS = { | |
| "cdc": { | |
| "youtube": "https://www.youtube.com/@CDCgov", | |
| "topics": ["public_health", "oral_health"] | |
| }, | |
| "hrsa": { | |
| "youtube": "https://www.youtube.com/@HRSAgov", | |
| "topics": ["health_centers", "dental_programs"] | |
| }, | |
| "state_health_depts": { | |
| # Each state's health department | |
| "CA": "https://www.youtube.com/@CAPublicHealth", | |
| "TX": "https://www.youtube.com/@TXHealthHumanServices", | |
| # ... all 50 states | |
| } | |
| } | |
| def get_federal_video_sources(): | |
| """ | |
| Get federal agency video channels relevant to oral health policy. | |
| Sources: | |
| - usa.gov/archive featured channels | |
| - State health departments | |
| - CDC, HRSA, CMS channels | |
| """ | |
| pass | |
| ``` | |
| ### Approach 3: ELGL Top YouTube Channels (NEW - HIGHLY RECOMMENDED!) | |
| **What:** ELGL (Engaging Local Government Leaders) publishes curated "Top Local Government YouTube Channels" lists | |
| **Why This is Excellent:** | |
| ``` | |
| β Curated by experts (not automated scraping) | |
| β Highlights MOST ACTIVE channels | |
| β Quality > Quantity approach | |
| β Updated annually | |
| β Covers innovative local governments nationwide | |
| ``` | |
| **Sources:** | |
| - ELGL Blog: https://elgl.org/ | |
| - Annual "Top Local Gov YouTube Channels" articles | |
| - Digital innovation showcases | |
| **Expected Coverage:** 50-100 top-tier channels (most active, highest quality) | |
| **Implementation:** See `discovery/curated_sources.py` | |
| ### Approach 4: NACo County Database (NEW - COMPREHENSIVE!) | |
| **What:** National Association of Counties maintains database of all 3,143 U.S. counties | |
| **Why This is Excellent:** | |
| ``` | |
| β Complete county coverage (all 3,143 counties) | |
| β Official website URLs verified by NACo | |
| β Digital innovation showcase | |
| β Authoritative source for county data | |
| β Partnership opportunities | |
| ``` | |
| **Sources:** | |
| - NACo County Explorer: https://ce.naco.org/ | |
| - Digital Counties Survey | |
| - NACo Communications Awards | |
| **Expected Coverage:** 3,143 counties with official websites | |
| **Implementation:** See `discovery/curated_sources.py` | |
| --- | |
| ## Complete Implementation Plan | |
| ### Phase 1: Enhance Existing Dataset Extraction (β DONE) | |
| - [x] MeetingBank video URLs (already working) | |
| - [x] Open States channels (already working) | |
| - [ ] City Scrapers Granicus video extraction (TODO) | |
| ### Phase 2: Website Social Media Discovery (β NEW MODULE CREATED) | |
| **Implementation:** | |
| 1. [x] Create `social_media_discovery.py` module | |
| 2. [ ] Test on sample cities (Seattle, Chicago, Austin) | |
| 3. [ ] Integrate with URL discovery pipeline | |
| 4. [ ] Write to Bronze layer: `bronze/social_media_channels` | |
| **Tasks:** | |
| ```bash | |
| # Test the new module | |
| cd discovery | |
| python social_media_discovery.py | |
| # Expected output: YouTube, Facebook, Vimeo URLs for test cities | |
| ``` | |
| ### Phase 3: USA.gov Integration (RECOMMENDED) | |
| **Priority: HIGH** - Most authoritative source | |
| **Tasks:** | |
| 1. [ ] Create `discovery/usa_gov_directory.py` | |
| 2. [ ] Scrape USA.gov local directory for official URLs | |
| 3. [ ] Use as primary source (confidence 0.98) | |
| 4. [ ] Fallback to pattern matching for missing entries | |
| **Estimated URLs:** | |
| - ~3,000 cities/counties with verified .gov URLs | |
| - ~10,000+ municipalities (including .org, .us domains) | |
| ### Phase 4: ELGL Curated Channels (NEW - HIGH PRIORITY!) | |
| **Priority: HIGH** - Quality over quantity | |
| **Tasks:** | |
| 1. [x] Create `discovery/curated_sources.py` β | |
| 2. [ ] Scrape ELGL "Top YouTube Channels" articles | |
| 3. [ ] Parse channel URLs and metadata | |
| 4. [ ] Flag as "top-ranked" in database | |
| **Expected Results:** | |
| - 50-100 most active local government channels | |
| - High-quality, verified content | |
| - Innovative digital communication examples | |
| **Why This Matters:** | |
| These are the channels with the MOST meeting videos and BEST production quality! | |
| ### Phase 5: NACo County Database (NEW - HIGH PRIORITY!) | |
| **Priority: HIGH** - Comprehensive county coverage | |
| **Tasks:** | |
| 1. [x] Create `discovery/curated_sources.py` β | |
| 2. [ ] Contact NACo for data partnership/export | |
| 3. [ ] Integrate NACo County Explorer data | |
| 4. [ ] Scrape digital innovation showcase | |
| 5. [ ] Cross-reference with GSA .gov domains | |
| **Expected Results:** | |
| - All 3,143 U.S. counties with official websites | |
| - Digital innovation leaders identified | |
| - County media hub URLs | |
| **Partnership Opportunity:** | |
| NACo may provide bulk data export or API access for research/public benefit projects. | |
| ### Phase 6: Federal Video Aggregators (OPTIONAL) | |
| **Priority: MEDIUM** - Supplementary source | |
| **Tasks:** | |
| 1. [ ] Create `discovery/federal_video_sources.py` | |
| 2. [ ] Compile federal agency channels (CDC, HRSA, etc.) | |
| 3. [ ] Compile state health department channels (all 50 states) | |
| 4. [ ] Add to video sources table | |
| **Use Case:** State-level policy analysis, federal program tracking | |
| --- | |
| ## Testing & Validation | |
| ### Test the New Social Media Discovery | |
| ```bash | |
| # 1. Install dependencies | |
| pip install httpx beautifulsoup4 | |
| # 2. Run standalone test | |
| cd /home/developer/projects/open-navigator | |
| python discovery/social_media_discovery.py | |
| # Expected output: | |
| # β Found 3 social media links for Seattle | |
| # youtube: 1 URLs | |
| # facebook: 1 URLs | |
| # twitter: 1 URLs | |
| # β Found 2 social media links for Chicago | |
| # youtube: 1 URLs | |
| # facebook: 1 URLs | |
| ``` | |
| ### Integration Test | |
| ```python | |
| # Full pipeline test | |
| from discovery.discovery_pipeline import DiscoveryPipeline | |
| pipeline = DiscoveryPipeline() | |
| # 1. Discover jurisdictions (Census data) | |
| # 2. Discover homepage URLs (GSA + patterns) | |
| # 3. NEW: Discover social media (footer scraping) | |
| # 4. Write all to Bronze layer | |
| results = await pipeline.run_full_pipeline( | |
| limit=100, | |
| include_social_media=True # NEW FLAG | |
| ) | |
| ``` | |
| --- | |
| ## Performance & Scalability | |
| ### Current Approach (Dataset-Only) | |
| - β Fast (no web requests) | |
| - β Reliable (static datasets) | |
| - β Limited coverage (only cities in datasets) | |
| - β Stale data (datasets not updated frequently) | |
| ### New Approach (Website Scraping) | |
| - β οΈ Slower (requires web requests) | |
| - β οΈ Less reliable (websites change) | |
| - β Comprehensive coverage (all cities with websites) | |
| - β Fresh data (real-time discovery) | |
| ### Hybrid Strategy (RECOMMENDED) | |
| 1. **Start with datasets** (MeetingBank, Open States) | |
| - Get 1,366 meetings with videos immediately | |
| - High confidence, validated data | |
| 2. **Supplement with website scraping** | |
| - Fill gaps for cities not in datasets | |
| - Discover newly created channels | |
| - Verify dataset URLs are still valid | |
| 3. **Use USA.gov for verification** | |
| - Highest confidence for homepage URLs | |
| Curated Lists (NEW - YOUR SUGGESTIONS!):** | |
| - ELGL Top Channels: 50-100 most active channels π₯ | |
| - NACo Counties: 3,143 counties with official websites π₯ | |
| - NACo Digital Innovation: ~100 innovative counties | |
| **Website Scraping Discovery (NEW):** | |
| - Major cities (100+ population): ~300 cities with YouTube | |
| - Medium cities (50k-100k): ~500 cities with social media | |
| - All municipalities: ~3,000-5,000 with public video channels | |
| **Total Potential:** | |
| - **3,000-5,000 YouTube channels** for meeting videos | |
| - **50-100 TOP-TIER channels** (ELGL curated) π | |
| - **3,143 county websites** (NACo database) π | |
| - **1,000+ Granicus portals** with embedded videos | |
| - **500+ Vimeo accounts** | |
| - **10,000+ Facebook pages** (may have video links) | |
| **Quality Tiers:** | |
| 1. **Tier 1 (Highest):** ELGL Top Channels - most active, best quality | |
| 2. **Tier 2 (High):** NACo Digital Innovation - county leaders | |
| 3. **Tier 3 (Good):** MeetingBank/Dataset channels - verified content | |
| 4. **Tier 4 (Discovery):** Website scraping - newly discoveredideo URLs | |
| # Priority 2: USA.gov verified (high confidence) | |
| usa_gov_cities = [...] # ~3,000 verified .gov sites | |
| # Priority 3: Website scraping (for gaps) | |
| remaining_cities = [...] # ~87,000 jurisdictions | |
| # Parallel processing | |
| async def process_batch(cities, batch_size=50): | |
| for i in range(0, len(cities), batch_size): | |
| batch = cities[i:i+batch_size] | |
| results = await social_discovery.discover_batch(batch) | |
| save_to_bronze(results) | |
| await asyncio.sleep(5) # Rate limiting | |
| ``` | |
| --- | |
| ## Expected Outcomes | |
| ### Coverage Estimates | |
| **Dataset-Based Discovery:** | |
| - MeetingBank: 6 cities β | |
| - Open States: 50+ state legislatures β | |
| - City Scrapers: 100-500 agencies β οΈ (need to extract video links) | |
| **Website Scraping Discovery (NEW):** | |
| - Major cities (100+ population): ~300 cities with YouTube | |
| - Medium cities (50k-100k): ~500 cities with social media | |
| - All municipalities: ~3,000-5,000 with public video channels | |
| **Total Potential:** | |
| - **3,000-5,000 YouTube channels** for meeting videos | |
| - **1,000+ Granicus portals** with embedded videos | |
| - **500+ Vimeo accounts** | |
| - **10,000+ Facebook pages** (may have video links) | |
| --- | |
| ## Next Steps | |
| ### Immediate Actions (This Week) | |
| 1. **Test Social Media Discovery** β READY TO RUN | |
| ```bash | |
| python discovery/social_media_discovery.py | |
| ``` | |
| 2. **Integrate with Pipeline** | |
| - Add to `discovery_pipeline.py` | |
| - Write results to Bronze layer | |
| - Create `bronze/social_media_channels` table | |
| 3. **Document Integration** | |
| - Update README with social media discovery | |
| - Add examples to documentation | |
| ### Short-term (Next 2 Weeks) | |
| 1. **USA.gov Integration** | |
| - Create `usa_gov_directory.py` | |
| - Scrape local directory | |
| - Use as primary URL source | |
| 2. **Enhanced MeetingBank Extraction** | |
| - Extract all video URLs from `urls` dictionary | |
| - Test on all 1,366 meetings | |
| - Validate YouTube links are still active | |
| 3. **City Scrapers Video Links** | |
| - Update `city_scrapers_urls.py` | |
| - Extract Granicus video URLs | |
| - Crawl Granicus pages for embedded YouTube | |
| ### Long-term (Next Month) | |
| 1. **Federal Aggregators** | |
| - USA.gov/archive integration | |
| - State health department channels | |
| - CDC/HRSA video collections | |
| 2. **Automated Validation** | |
| - Check if discovered channels still exist | |
| - Verify channels have meeting content | |
| - Score channels by video count and relevance | |
| 3. **Scale to 1,000+ Cities** | |
| - Batch processing framework | |
| - Parallel scraping with rate limiting | |
| - Delta Lake storage for discovered channels | |
| --- | |
| ## Conclusion | |
| ### Summary | |
| **Current State:** | |
| - β Video URLs extracted from datasets (1,366 meetings) | |
| - β No website scraping for social media links | |
| - β No USA.gov integration | |
| **Your Suggestion:** | |
| - β **Excellent idea!** Website scraping is the missing piece | |
| - β USA.gov provides most authoritative homepage URLs | |
| - β Footer/contact page scraping will find channels | |
| **Implementation:** | |
| - β Created `social_media_discovery.py` module | |
| - β Ready to test and integrate | |
| - β USA.gov integration guide provided | |
| - β Full roadmap for 1,000+ city coverage | |
| **Impact:** | |
| Going from 6 cities with video URLs to **3,000-5,000 cities with YouTube channels** will dramatically increase the reach of the Oral Health Policy Pulse system! | |