# 🎯 Integration Status: URL Sources with Video Links ## Summary: Partially Integrated, Need to Add Video URLs | Source | Status | What We Have | What's Missing | Priority | |--------|--------|--------------|----------------|----------| | **MeetingBank** | ⚠️ Partial | Transcripts & summaries | **YouTube/Vimeo URLs** | 🔥 **HIGH** | | **City Scrapers / Documenters** | ❌ Missing | Event schemas only | **Actual URL database** | 🔥 **HIGH** | | **Open States** | ❌ Missing | Nothing | **State & local video sources** | 🟡 MEDIUM | --- ## 1. MeetingBank (PARTIALLY INTEGRATED) ### ✅ What We Already Have: - **Dataset**: `huuuyeah/meetingbank` - 1,366 meetings with transcripts & summaries - **Integration**: [`discovery/meetingbank_ingestion.py`](../discovery/meetingbank_ingestion.py) - **Status**: Working, can download and ingest now ### ❌ What's Missing: VIDEO URLs! The user is correct - MeetingBank has **YouTube and Vimeo URLs** that we're NOT extracting yet! **Two MeetingBank datasets exist**: 1. `huuuyeah/meetingbank` - Main dataset (what we use now) 2. `lytang/MeetingBank-transcript` - 6,892 transcript segments Both contain **URLs dictionaries** with: - YouTube video IDs - Vimeo links - Archive.org links **Archive.org Video Collections**: - https://archive.org/details/meetingbank-alameda - https://archive.org/details/meetingbank-boston - https://archive.org/details/meetingbank-denver - https://archive.org/details/meetingbank-long-beach - https://archive.org/details/meetingbank-king-county - https://archive.org/details/meetingbank-seattle ### 🔥 ACTION NEEDED: Update `meetingbank_ingestion.py` to extract video URLs: ```python # Add to meetingbank_ingestion.py def extract_video_urls_from_meetingbank(meetingbank: dict) -> List[Dict]: """ Extract YouTube and Vimeo URLs from MeetingBank dataset. MeetingBank stores URLs in the 'urls' field of each meeting instance. """ video_urls = [] for split in ['train', 'validation', 'test']: for instance in meetingbank[split]: # Extract URL dictionary urls = instance.get('urls', {}) # YouTube URLs if 'youtube_id' in urls: youtube_url = f"https://www.youtube.com/watch?v={urls['youtube_id']}" video_urls.append({ "meeting_id": instance['id'], "video_url": youtube_url, "platform": "youtube", "city": extract_city_from_id(instance['id'])['name'], "state": extract_city_from_id(instance['id'])['state'] }) # Vimeo URLs if 'vimeo_id' in urls: vimeo_url = f"https://vimeo.com/{urls['vimeo_id']}" video_urls.append({ "meeting_id": instance['id'], "video_url": vimeo_url, "platform": "vimeo", "city": extract_city_from_id(instance['id'])['name'], "state": extract_city_from_id(instance['id'])['state'] }) # Archive.org URLs if 'archive_url' in urls: video_urls.append({ "meeting_id": instance['id'], "video_url": urls['archive_url'], "platform": "archive_org", "city": extract_city_from_id(instance['id'])['name'], "state": extract_city_from_id(instance['id'])['state'] }) return video_urls ``` ### Also Check: `lytang/MeetingBank-transcript` This is a companion dataset with 6,892 transcript segments. Load it too: ```python from datasets import load_dataset # Load both datasets meetingbank_main = load_dataset("huuuyeah/meetingbank") meetingbank_transcripts = load_dataset("lytang/MeetingBank-transcript") # MeetingBank-transcript has more detailed segment-level data # Each row has: meeting_id, segment_id, transcript, summary, urls ``` --- ## 2. City Scrapers / Documenters.org (NOT INTEGRATED) ### ❌ What We Have: - Only their **code patterns** (event schema, testing framework) - We have NOT integrated their **actual URL database** ### What They Have (That We Need): **Documenters.org** maintains a **centralized database** of meeting URLs for dozens of cities. ### Where the Data Lives: 1. **City Scrapers GitHub Repos** (5 deployments): - https://github.com/city-scrapers/city-scrapers (Chicago ~100 agencies) - https://github.com/city-scrapers/city-scrapers-pitt (Pittsburgh) - https://github.com/city-scrapers/city-scrapers-detroit (Detroit) - https://github.com/city-scrapers/city-scrapers-cle (Cleveland) - https://github.com/city-scrapers/city-scrapers-la (Los Angeles) 2. **Each Spider File** contains: ```python # Example: city_scrapers/spiders/chi_board_of_health.py class ChiBoardOfHealthSpider(CityScrapersSpider): name = "chi_board_of_health" agency = "Chicago Board of Health" start_urls = ["https://www.chicago.gov/city/en/depts/cdph/provdrs/board_of_health.html"] # This spider extracts: # - Meeting URLs # - Video links (often Granicus ViewPublisher with YouTube embeds) # - Agenda PDFs # - Minutes PDFs ``` 3. **Granicus "Video" Button Pattern**: ```python # Many City Scrapers extract Granicus video pages # Granicus embeds YouTube/Vimeo in their ViewPublisher interface # Pattern: https://city.granicus.com/ViewPublisher.php?view_id=XXX # This page contains