Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # β Confirmed: HuggingFace Datasets That WILL Help | |
| ## Quick Answer: YES, 2 of 4 will help significantly! | |
| | Dataset | Status | Usefulness | Priority | | |
| |---------|--------|------------|----------| | |
| | **MeetingBank** | β **READY TO USE** | π₯ **VERY HIGH** | **USE IMMEDIATELY** | | |
| | **LocalView** | β Already covered | HIGH | Download from Harvard | | |
| | **Council Data Project** | β Already covered | HIGH | Already integrated | | |
| | **CivicBand** | β οΈ Limited access | MEDIUM | Scrape municipality list | | |
| --- | |
| ## 1. MeetingBank π₯ (NEW! USE THIS!) | |
| ### What It Is: | |
| **A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization** | |
| ### URLs: | |
| - **HuggingFace (text)**: https://huggingface.co/datasets/huuuyeah/meetingbank | |
| - **HuggingFace (audio)**: https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio | |
| - **Zenodo (all files)**: https://zenodo.org/record/7989108 | |
| - **Archive.org (videos)**: | |
| - https://archive.org/details/meetingbank-alameda | |
| - https://archive.org/details/meetingbank-boston | |
| - https://archive.org/details/meetingbank-denver | |
| - https://archive.org/details/meetingbank-long-beach | |
| - https://archive.org/details/meetingbank-king-county | |
| - https://archive.org/details/meetingbank-seattle | |
| ### What You Get: | |
| β **1,366 city council meetings** from 6 cities: | |
| - Alameda, CA | |
| - Boston, MA | |
| - Denver, CO | |
| - King County, WA | |
| - Long Beach, CA | |
| - Seattle, WA | |
| β **3,579 hours of video** | |
| β **Full transcripts** (average 28,000 tokens per meeting) | |
| β **PDF meeting minutes & agendas** | |
| β **Human-written summaries** (ground truth for evaluation) | |
| β **Machine-generated summaries** (from 6 different systems) | |
| β **6,892 segment-level summarization instances** for training | |
| ### Why This Is PERFECT for Your Project: | |
| 1. **Immediate prototyping**: Download from HuggingFace in 5 minutes | |
| ```python | |
| from datasets import load_dataset | |
| meetingbank = load_dataset("huuuyeah/meetingbank") | |
| for instance in meetingbank['train']: | |
| print(instance['id']) | |
| print(instance['summary']) | |
| print(instance['transcript']) | |
| ``` | |
| 2. **Quality validation**: Compare your AI summarization against human-written summaries | |
| 3. **URL discovery**: Each meeting has source URLs to city websites | |
| 4. **Benchmark your oral health keyword detection**: Test against 1,366 real transcripts | |
| 5. **Training data**: If you want to fine-tune models for oral health policy | |
| ### Paper: | |
| "MeetingBank: A Benchmark Dataset for Meeting Summarization" | |
| ACL 2023 (Association for Computational Linguistics) | |
| https://arxiv.org/abs/2305.17529 | |
| ### π― ACTION PLAN: | |
| ```bash | |
| # 1. Install HuggingFace datasets | |
| pip install datasets | |
| # 2. Download MeetingBank | |
| python -c " | |
| from datasets import load_dataset | |
| meetingbank = load_dataset('huuuyeah/meetingbank') | |
| print(f'Loaded {len(meetingbank['train'])} training instances') | |
| " | |
| # 3. Create discovery/meetingbank_ingestion.py | |
| # - Parse meetings | |
| # - Extract URLs | |
| # - Load to Bronze layer | |
| # - Run keyword detection on transcripts | |
| # - Evaluate against human summaries | |
| ``` | |
| ### Expected ROI: | |
| - **Time**: 2 hours to integrate | |
| - **Value**: 1,366 meetings with transcripts + summaries + URLs | |
| - **Quality**: Academic benchmark (peer-reviewed, ACL published) | |
| - **Coverage**: 6 major cities (all large, high-value for advocacy) | |
| --- | |
| ## 2. LocalView β (Already Covered) | |
| **Status**: Already identified in previous investigation | |
| **Location**: Harvard Dataverse (doi:10.7910/DVN/NJTBEM) | |
| **Coverage**: 1,000-10,000 jurisdictions | |
| **Action**: Download from Harvard (already documented) | |
| --- | |
| ## 3. Council Data Project β (Already Covered) | |
| **Status**: Already integrated in [`external_url_datasets.py`](../discovery/external_url_datasets.py) | |
| **Coverage**: 20+ cities with full pipelines | |
| **Action**: Already coded, just run the script | |
| --- | |
| ## 4. CivicBand β οΈ (Limited Usefulness) | |
| ### What It Is: | |
| "Largest public collection of civic meeting and election finance data" | |
| Website: https://civic.band/ | |
| ### What Exists: | |
| β **1,031 municipalities tracked** | |
| β Millions of pages scraped (meeting minutes, agendas) | |
| β Search interface available | |
| β Publicly browsable | |
| ### The Problem: | |
| β **"Dataset access is via their platform; raw dumps require coordination"** | |
| - Can't directly download bulk URL list | |
| - Would need to contact founder (Philip James: hello@civic.band) | |
| - Or scrape the municipality list from their website | |
| ### What You CAN Get: | |
| The list of 1,031 municipalities is publicly visible on their site. You could: | |
| 1. **Scrape the municipality list** (city names + states) | |
| 2. **Match against your Census data** to get FIPS codes | |
| 3. **Use as verification** (these 1,031 are confirmed to have meeting data) | |
| ### Limited Value Because: | |
| - Can't get direct URLs (need to coordinate with founder) | |
| - Already have larger coverage from LocalView (1,000-10,000 jurisdictions) | |
| - Already have premium coverage from CDP (20 cities) | |
| - CivicBand's main value is their *content* (scraped minutes), not URLs | |
| ### Possible Action: | |
| ```python | |
| # Scrape CivicBand's municipality list | |
| import requests | |
| from bs4 import BeautifulSoup | |
| response = requests.get("https://civic.band/") | |
| soup = BeautifulSoup(response.text, 'html.parser') | |
| # Parse the table of municipalities | |
| # Match against Census data | |
| # Use as validation list | |
| ``` | |
| **Estimated value**: MEDIUM (validation only, not bulk URLs) | |
| --- | |
| ## π Revised Priority Ranking | |
| ### IMMEDIATE (Do This Week): | |
| 1. π₯ **Download MeetingBank** (2 hours) | |
| - HuggingFace dataset ready to use | |
| - 1,366 meetings with transcripts, summaries, URLs | |
| - Perfect for prototyping and evaluation | |
| ### HIGH PRIORITY (Do This Month): | |
| 2. β **Download LocalView** (1 day) | |
| - Harvard Dataverse | |
| - 1,000-10,000 jurisdictions | |
| 3. β **Run CDP integration** (2 hours) | |
| - Already coded | |
| - 20 premium cities | |
| ### MEDIUM PRIORITY (Optional): | |
| 4. β οΈ **Scrape CivicBand list** (4 hours) | |
| - 1,031 municipality names | |
| - Use for validation | |
| - Or contact founder for bulk access | |
| --- | |
| ## π― Updated Integration Code | |
| ### Add MeetingBank to your pipeline: | |
| ```python | |
| # discovery/meetingbank_ingestion.py | |
| from datasets import load_dataset | |
| from pyspark.sql import SparkSession | |
| from loguru import logger | |
| def load_meetingbank_to_bronze(spark: SparkSession) -> dict: | |
| """ | |
| Load MeetingBank dataset to Bronze layer. | |
| MeetingBank contains 1,366 city council meetings from 6 major cities | |
| with full transcripts, summaries, and source URLs. | |
| """ | |
| logger.info("Loading MeetingBank dataset from HuggingFace") | |
| # Download from HuggingFace | |
| meetingbank = load_dataset("huuuyeah/meetingbank") | |
| meetings = [] | |
| for split in ['train', 'validation', 'test']: | |
| for instance in meetingbank[split]: | |
| meetings.append({ | |
| "meeting_id": instance['id'], | |
| "jurisdiction_name": instance.get('city', 'Unknown'), | |
| "state_code": instance.get('state', 'Unknown'), | |
| "transcript": instance['transcript'], | |
| "summary_human": instance['summary'], | |
| "source_url": instance.get('url', ''), | |
| "date": instance.get('date', ''), | |
| "has_transcript": True, | |
| "has_summary": True, | |
| "has_url": bool(instance.get('url')), | |
| "transcript_length": len(instance['transcript']), | |
| "source": "meetingbank" | |
| }) | |
| # Convert to DataFrame | |
| df = spark.createDataFrame(meetings) | |
| # Write to Bronze layer | |
| output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings" | |
| df.write \ | |
| .format("delta") \ | |
| .mode("overwrite") \ | |
| .save(output_path) | |
| logger.info(f"β Loaded {len(meetings)} meetings from MeetingBank") | |
| return { | |
| "total_meetings": len(meetings), | |
| "cities": 6, | |
| "source": "meetingbank" | |
| } | |
| ``` | |
| ### Test your keyword detection: | |
| ```python | |
| # Test keyword detection on MeetingBank transcripts | |
| from datasets import load_dataset | |
| from alerts.keyword_monitor import KeywordAlertSystem | |
| meetingbank = load_dataset("huuuyeah/meetingbank") | |
| alert_system = KeywordAlertSystem() | |
| # Test on first 10 meetings | |
| for instance in meetingbank['train'][:10]: | |
| matches = alert_system._find_keywords_in_text( | |
| instance['transcript'], | |
| alert_system.KEYWORD_CATEGORIES | |
| ) | |
| if matches: | |
| print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found") | |
| for match in matches[:3]: # Show first 3 | |
| print(f" - {match.keyword} ({match.category})") | |
| ``` | |
| ### Evaluate your AI summarization: | |
| ```python | |
| # Compare your summaries against human-written ground truth | |
| from extraction.summarizer import MeetingSummarizer | |
| from datasets import load_dataset | |
| summarizer = MeetingSummarizer() | |
| meetingbank = load_dataset("huuuyeah/meetingbank") | |
| for instance in meetingbank['test'][:10]: | |
| # Generate your summary | |
| your_summary = summarizer.summarize( | |
| event=None, # Create MeetingEvent from instance | |
| full_text=instance['transcript'], | |
| focus_on_health=False | |
| ) | |
| # Compare against human summary | |
| human_summary = instance['summary'] | |
| print(f"Meeting: {instance['id']}") | |
| print(f"Your summary: {your_summary.executive_summary}") | |
| print(f"Human summary: {human_summary}") | |
| print(f"Quality: {your_summary.confidence_score}") | |
| print() | |
| ``` | |
| --- | |
| ## π Expected Outcomes | |
| ### Before MeetingBank: | |
| - 76 URLs discovered (15% match rate) | |
| - No evaluation benchmark | |
| - No ground truth for summarization | |
| ### After MeetingBank: | |
| - **+1,366 meetings** with transcripts | |
| - **+6 major cities** with verified URLs | |
| - **Academic benchmark** for evaluation | |
| - **Human summaries** for quality validation | |
| - **Total meetings**: 1,366 ready to analyze immediately | |
| --- | |
| ## π Final Recommendation | |
| ### DO THIS FIRST (2 hours): | |
| ```bash | |
| # 1. Install HuggingFace datasets | |
| pip install datasets | |
| # 2. Download MeetingBank | |
| python -c " | |
| from datasets import load_dataset | |
| meetingbank = load_dataset('huuuyeah/meetingbank') | |
| print(f'β Downloaded {len(meetingbank[\"train\"])} meetings') | |
| " | |
| # 3. Create integration script | |
| # See code example above | |
| # 4. Test your keyword detection | |
| # See test code above | |
| # 5. Evaluate your summarization | |
| # See evaluation code above | |
| ``` | |
| ### Expected Result: | |
| - **Immediate access** to 1,366 meetings | |
| - **6 major cities** for prototyping | |
| - **Academic quality** benchmark | |
| - **Proven ROI**: Published in top NLP conference (ACL 2023) | |
| --- | |
| ## Summary Table | |
| | Dataset | Available? | Download Time | Meetings | Usefulness | | |
| |---------|-----------|---------------|----------|------------| | |
| | **MeetingBank** | β **YES** (HuggingFace) | **5 minutes** | **1,366** | π₯ **VERY HIGH** | | |
| | **LocalView** | β YES (Harvard) | 1 day | 1,000-10,000 | π₯ VERY HIGH | | |
| | **CDP** | β YES (already coded) | 2 hours | 20 cities | π₯ HIGH | | |
| | **CivicBand** | β οΈ PARTIAL (need coordination) | 4 hours | 1,031 list | π‘ MEDIUM | | |
| **Bottom line**: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts. | |