Spaces:
Running on CPU Upgrade
π― Integration Status: URL Sources with Video Links
Summary: Partially Integrated, Need to Add Video URLs
| Source | Status | What We Have | What's Missing | Priority |
|---|---|---|---|---|
| MeetingBank | β οΈ Partial | Transcripts & summaries | YouTube/Vimeo URLs | π₯ HIGH |
| City Scrapers / Documenters | β Missing | Event schemas only | Actual URL database | π₯ HIGH |
| Open States | β Missing | Nothing | State & local video sources | π‘ MEDIUM |
1. MeetingBank (PARTIALLY INTEGRATED)
β What We Already Have:
- Dataset:
huuuyeah/meetingbank- 1,366 meetings with transcripts & summaries - Integration:
discovery/meetingbank_ingestion.py - Status: Working, can download and ingest now
β What's Missing: VIDEO URLs!
The user is correct - MeetingBank has YouTube and Vimeo URLs that we're NOT extracting yet!
Two MeetingBank datasets exist:
huuuyeah/meetingbank- Main dataset (what we use now)lytang/MeetingBank-transcript- 6,892 transcript segments
Both contain URLs dictionaries with:
- YouTube video IDs
- Vimeo links
- Archive.org links
Archive.org Video Collections:
- https://archive.org/details/meetingbank-alameda
- https://archive.org/details/meetingbank-boston
- https://archive.org/details/meetingbank-denver
- https://archive.org/details/meetingbank-long-beach
- https://archive.org/details/meetingbank-king-county
- https://archive.org/details/meetingbank-seattle
π₯ ACTION NEEDED:
Update meetingbank_ingestion.py to extract video URLs:
# Add to meetingbank_ingestion.py
def extract_video_urls_from_meetingbank(meetingbank: dict) -> List[Dict]:
"""
Extract YouTube and Vimeo URLs from MeetingBank dataset.
MeetingBank stores URLs in the 'urls' field of each meeting instance.
"""
video_urls = []
for split in ['train', 'validation', 'test']:
for instance in meetingbank[split]:
# Extract URL dictionary
urls = instance.get('urls', {})
# YouTube URLs
if 'youtube_id' in urls:
youtube_url = f"https://www.youtube.com/watch?v={urls['youtube_id']}"
video_urls.append({
"meeting_id": instance['id'],
"video_url": youtube_url,
"platform": "youtube",
"city": extract_city_from_id(instance['id'])['name'],
"state": extract_city_from_id(instance['id'])['state']
})
# Vimeo URLs
if 'vimeo_id' in urls:
vimeo_url = f"https://vimeo.com/{urls['vimeo_id']}"
video_urls.append({
"meeting_id": instance['id'],
"video_url": vimeo_url,
"platform": "vimeo",
"city": extract_city_from_id(instance['id'])['name'],
"state": extract_city_from_id(instance['id'])['state']
})
# Archive.org URLs
if 'archive_url' in urls:
video_urls.append({
"meeting_id": instance['id'],
"video_url": urls['archive_url'],
"platform": "archive_org",
"city": extract_city_from_id(instance['id'])['name'],
"state": extract_city_from_id(instance['id'])['state']
})
return video_urls
Also Check: lytang/MeetingBank-transcript
This is a companion dataset with 6,892 transcript segments. Load it too:
from datasets import load_dataset
# Load both datasets
meetingbank_main = load_dataset("huuuyeah/meetingbank")
meetingbank_transcripts = load_dataset("lytang/MeetingBank-transcript")
# MeetingBank-transcript has more detailed segment-level data
# Each row has: meeting_id, segment_id, transcript, summary, urls
2. City Scrapers / Documenters.org (NOT INTEGRATED)
β What We Have:
- Only their code patterns (event schema, testing framework)
- We have NOT integrated their actual URL database
What They Have (That We Need):
Documenters.org maintains a centralized database of meeting URLs for dozens of cities.
Where the Data Lives:
City Scrapers GitHub Repos (5 deployments):
- https://github.com/city-scrapers/city-scrapers (Chicago ~100 agencies)
- https://github.com/city-scrapers/city-scrapers-pitt (Pittsburgh)
- https://github.com/city-scrapers/city-scrapers-detroit (Detroit)
- https://github.com/city-scrapers/city-scrapers-cle (Cleveland)
- https://github.com/city-scrapers/city-scrapers-la (Los Angeles)
Each Spider File contains:
# Example: city_scrapers/spiders/chi_board_of_health.py class ChiBoardOfHealthSpider(CityScrapersSpider): name = "chi_board_of_health" agency = "Chicago Board of Health" start_urls = ["https://www.chicago.gov/city/en/depts/cdph/provdrs/board_of_health.html"] # This spider extracts: # - Meeting URLs # - Video links (often Granicus ViewPublisher with YouTube embeds) # - Agenda PDFs # - Minutes PDFsGranicus "Video" Button Pattern:
# Many City Scrapers extract Granicus video pages # Granicus embeds YouTube/Vimeo in their ViewPublisher interface # Pattern: https://city.granicus.com/ViewPublisher.php?view_id=XXX # This page contains <iframe src="https://www.youtube.com/embed/VIDEO_ID">
π₯ ACTION NEEDED:
Create discovery/city_scrapers_urls.py:
"""
Extract URLs from City Scrapers spider files.
City Scrapers maintains 100-500 validated agency URLs across 5 cities.
Each spider file contains start_urls and scraping logic for meeting pages.
"""
import re
import requests
from pathlib import Path
from typing import List, Dict
CITY_SCRAPERS_REPOS = [
{
"city": "Chicago",
"state": "IL",
"repo": "https://github.com/city-scrapers/city-scrapers",
"spiders_path": "city_scrapers/spiders"
},
{
"city": "Pittsburgh",
"state": "PA",
"repo": "https://github.com/city-scrapers/city-scrapers-pitt",
"spiders_path": "city_scrapers_pitt/spiders"
},
{
"city": "Detroit",
"state": "MI",
"repo": "https://github.com/city-scrapers/city-scrapers-detroit",
"spiders_path": "city_scrapers_det/spiders"
},
{
"city": "Cleveland",
"state": "OH",
"repo": "https://github.com/city-scrapers/city-scrapers-cle",
"spiders_path": "city_scrapers_cle/spiders"
},
{
"city": "Los Angeles",
"state": "CA",
"repo": "https://github.com/city-scrapers/city-scrapers-la",
"spiders_path": "city_scrapers_la/spiders"
}
]
def extract_start_urls_from_spider_file(spider_file_content: str) -> List[str]:
"""
Extract start_urls from a City Scrapers spider file.
Pattern matches:
- start_urls = ["https://..."]
- start_urls = ['https://...']
"""
urls = []
# Match start_urls = [...]
pattern = r'start_urls\s*=\s*\[(.*?)\]'
matches = re.findall(pattern, spider_file_content, re.DOTALL)
for match in matches:
# Extract quoted strings
url_pattern = r'["\']([^"\']+)["\']'
found_urls = re.findall(url_pattern, match)
urls.extend(found_urls)
return urls
def clone_and_extract_city_scrapers_urls() -> List[Dict]:
"""
Clone all City Scrapers repos and extract URLs from spider files.
Returns list of dicts with:
- url: Meeting page URL
- city: City name
- state: State code
- agency: Agency name (from spider file)
- source: "city_scrapers"
"""
import subprocess
import tempfile
all_urls = []
with tempfile.TemporaryDirectory() as tmpdir:
for repo_info in CITY_SCRAPERS_REPOS:
# Clone repo
repo_path = Path(tmpdir) / repo_info['city']
subprocess.run([
"git", "clone", "--depth", "1",
repo_info['repo'], str(repo_path)
])
# Find spider files
spiders_path = repo_path / repo_info['spiders_path']
if not spiders_path.exists():
continue
for spider_file in spiders_path.glob("*.py"):
if spider_file.name.startswith("_"):
continue
# Read spider file
content = spider_file.read_text()
# Extract start_urls
urls = extract_start_urls_from_spider_file(content)
# Extract agency name from spider class
agency_pattern = r'agency\s*=\s*["\']([^"\']+)["\']'
agency_match = re.search(agency_pattern, content)
agency = agency_match.group(1) if agency_match else spider_file.stem
for url in urls:
all_urls.append({
"url": url,
"city": repo_info['city'],
"state": repo_info['state'],
"agency": agency,
"source": "city_scrapers"
})
return all_urls
Expected Results:
- 100-500 agency URLs with validated meeting pages
- Granicus video page URLs (many contain YouTube embeds)
- Legistar URLs (with API access)
- PDF agendas and minutes (publicly accessible)
3. Open States (NOT INTEGRATED)
What It Is:
Open States (now part of Plural) is the most comprehensive state legislative data project.
Website: https://openstates.org
API: https://openstates.org/api/
Data: https://data.openstates.org/
What They Have:
- State legislatures: All 50 states + DC + Puerto Rico
- Local jurisdictions: Expanding to city councils
- Sources field: Contains YouTube channel URLs, Vimeo profiles
- Video archives: Many states host videos on YouTube
API Example:
import requests
# Get jurisdiction info
response = requests.get(
"https://v3.openstates.org/jurisdictions",
headers={"X-API-KEY": "YOUR_API_KEY"} # Free tier: 50k requests/month
)
# Each jurisdiction has:
# - sources: [{"url": "https://youtube.com/@CALegislature"}]
# - legislative_sessions: with video URLs
# - people: legislators with social media
π₯ ACTION NEEDED:
Create discovery/openstates_sources.py:
"""
Extract video sources from Open States API.
Open States tracks video URLs in their 'sources' field for:
- State legislatures (50+ YouTube channels)
- City councils (expanding coverage)
- County boards (select jurisdictions)
"""
import requests
from typing import List, Dict
OPENSTATES_API = "https://v3.openstates.org"
def get_openstates_jurisdictions(api_key: str) -> List[Dict]:
"""
Fetch all jurisdictions from Open States API.
Returns list of jurisdictions with video sources.
"""
response = requests.get(
f"{OPENSTATES_API}/jurisdictions",
headers={"X-API-KEY": api_key}
)
jurisdictions = response.json()['results']
video_sources = []
for jurisdiction in jurisdictions:
# Extract sources field
sources = jurisdiction.get('sources', [])
for source in sources:
url = source.get('url', '')
# Check if it's a video platform
if any(platform in url for platform in ['youtube', 'vimeo', 'granicus']):
video_sources.append({
"jurisdiction_id": jurisdiction['id'],
"jurisdiction_name": jurisdiction['name'],
"classification": jurisdiction.get('classification', ''),
"video_url": url,
"platform": extract_platform(url),
"source": "openstates"
})
return video_sources
def extract_platform(url: str) -> str:
"""Extract platform from URL."""
if 'youtube.com' in url or 'youtu.be' in url:
return 'youtube'
elif 'vimeo.com' in url:
return 'vimeo'
elif 'granicus.com' in url:
return 'granicus'
elif 'archive.org' in url:
return 'archive_org'
else:
return 'other'
Expected Results:
- 50+ state YouTube channels (e.g., @CALegislature, @NYSenate)
- Local council channels (expanding)
- Committee hearing archives
- Free API: 50,000 requests/month (plenty for our needs)
π Combined Impact
Current Coverage (Without These):
- 85,302 Census jurisdictions
- 76 URLs discovered (15% match rate)
- 20 CDP cities
- 1,366 MeetingBank meetings (but no video URLs extracted)
After Integration:
| Source | URLs Added | Video Links | Quality |
|---|---|---|---|
| MeetingBank (videos) | 1,366 | β YouTube/Vimeo | Excellent |
| City Scrapers (URLs) | 100-500 | β Granicus β YouTube | Good |
| Open States (channels) | 50-100 | β YouTube channels | Excellent |
| TOTAL NEW | 1,500-2,000 | β All have videos | High |
Why This Matters:
π― Video URLs = Transcription Ready
- YouTube has auto-captions (free API)
- Vimeo has captions (often)
- Can use Whisper for transcription
- Archive.org has downloadable videos
π― Validated Sources
- All these URLs are already scraped/validated by other projects
- High success rate (80-100%)
- Active maintenance by civic tech community
π Implementation Priority
Week 1: Update MeetingBank Integration (2 hours)
# Update meetingbank_ingestion.py to extract video URLs
# Load lytang/MeetingBank-transcript dataset
# Extract YouTube IDs, Vimeo IDs, Archive.org links
# Write to bronze/meetingbank_video_urls table
Expected: 1,366 video URLs (100% success)
Week 2: City Scrapers URL Extraction (1 day)
# Clone 5 City Scrapers repos
# Extract start_urls from spider files
# Parse Granicus video pages for YouTube embeds
# Write to bronze/city_scrapers_urls table
Expected: 100-500 validated meeting URLs
Week 3: Open States Integration (4 hours)
# Sign up for Open States API (free)
# Fetch jurisdictions with video sources
# Extract YouTube channels and Vimeo profiles
# Write to bronze/openstates_sources table
Expected: 50-100 legislative video sources
β Summary
| Integration | Status | Action Needed | Time | Priority |
|---|---|---|---|---|
| MeetingBank videos | β οΈ Partial | Extract video URLs from existing integration | 2 hours | π₯ HIGH |
| City Scrapers URLs | β Missing | Clone repos, parse spider files | 1 day | π₯ HIGH |
| Open States | β Missing | API integration, extract sources | 4 hours | π‘ MEDIUM |
Bottom line: We have MeetingBank transcripts but NOT the video URLs yet. City Scrapers and Open States are completely missing. All three would add 1,500-2,000 verified video URLs - the highest quality sources possible! π―