open-navigator / docs /HUGGINGFACE_DATASETS_ANALYSIS.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

βœ… Confirmed: HuggingFace Datasets That WILL Help

Quick Answer: YES, 2 of 4 will help significantly!

Dataset Status Usefulness Priority
MeetingBank βœ… READY TO USE πŸ”₯ VERY HIGH USE IMMEDIATELY
LocalView βœ… Already covered HIGH Download from Harvard
Council Data Project βœ… Already covered HIGH Already integrated
CivicBand ⚠️ Limited access MEDIUM Scrape municipality list

1. MeetingBank πŸ”₯ (NEW! USE THIS!)

What It Is:

A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization

URLs:

What You Get:

βœ… 1,366 city council meetings from 6 cities:

  • Alameda, CA
  • Boston, MA
  • Denver, CO
  • King County, WA
  • Long Beach, CA
  • Seattle, WA

βœ… 3,579 hours of video

βœ… Full transcripts (average 28,000 tokens per meeting)

βœ… PDF meeting minutes & agendas

βœ… Human-written summaries (ground truth for evaluation)

βœ… Machine-generated summaries (from 6 different systems)

βœ… 6,892 segment-level summarization instances for training

Why This Is PERFECT for Your Project:

  1. Immediate prototyping: Download from HuggingFace in 5 minutes

    from datasets import load_dataset
    meetingbank = load_dataset("huuuyeah/meetingbank")
    
    for instance in meetingbank['train']:
        print(instance['id'])
        print(instance['summary'])
        print(instance['transcript'])
    
  2. Quality validation: Compare your AI summarization against human-written summaries

  3. URL discovery: Each meeting has source URLs to city websites

  4. Benchmark your oral health keyword detection: Test against 1,366 real transcripts

  5. Training data: If you want to fine-tune models for oral health policy

Paper:

"MeetingBank: A Benchmark Dataset for Meeting Summarization"
ACL 2023 (Association for Computational Linguistics)
https://arxiv.org/abs/2305.17529

🎯 ACTION PLAN:

# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'Loaded {len(meetingbank['train'])} training instances')
"

# 3. Create discovery/meetingbank_ingestion.py
# - Parse meetings
# - Extract URLs
# - Load to Bronze layer
# - Run keyword detection on transcripts
# - Evaluate against human summaries

Expected ROI:

  • Time: 2 hours to integrate
  • Value: 1,366 meetings with transcripts + summaries + URLs
  • Quality: Academic benchmark (peer-reviewed, ACL published)
  • Coverage: 6 major cities (all large, high-value for advocacy)

2. LocalView βœ… (Already Covered)

Status: Already identified in previous investigation
Location: Harvard Dataverse (doi:10.7910/DVN/NJTBEM)
Coverage: 1,000-10,000 jurisdictions
Action: Download from Harvard (already documented)


3. Council Data Project βœ… (Already Covered)

Status: Already integrated in external_url_datasets.py
Coverage: 20+ cities with full pipelines
Action: Already coded, just run the script


4. CivicBand ⚠️ (Limited Usefulness)

What It Is:

"Largest public collection of civic meeting and election finance data"
Website: https://civic.band/

What Exists:

βœ… 1,031 municipalities tracked
βœ… Millions of pages scraped (meeting minutes, agendas)
βœ… Search interface available
βœ… Publicly browsable

The Problem:

❌ "Dataset access is via their platform; raw dumps require coordination"

  • Can't directly download bulk URL list
  • Would need to contact founder (Philip James: hello@civic.band)
  • Or scrape the municipality list from their website

What You CAN Get:

The list of 1,031 municipalities is publicly visible on their site. You could:

  1. Scrape the municipality list (city names + states)
  2. Match against your Census data to get FIPS codes
  3. Use as verification (these 1,031 are confirmed to have meeting data)

Limited Value Because:

  • Can't get direct URLs (need to coordinate with founder)
  • Already have larger coverage from LocalView (1,000-10,000 jurisdictions)
  • Already have premium coverage from CDP (20 cities)
  • CivicBand's main value is their content (scraped minutes), not URLs

Possible Action:

# Scrape CivicBand's municipality list
import requests
from bs4 import BeautifulSoup

response = requests.get("https://civic.band/")
soup = BeautifulSoup(response.text, 'html.parser')

# Parse the table of municipalities
# Match against Census data
# Use as validation list

Estimated value: MEDIUM (validation only, not bulk URLs)


πŸ“Š Revised Priority Ranking

IMMEDIATE (Do This Week):

  1. πŸ”₯ Download MeetingBank (2 hours)
    • HuggingFace dataset ready to use
    • 1,366 meetings with transcripts, summaries, URLs
    • Perfect for prototyping and evaluation

HIGH PRIORITY (Do This Month):

  1. βœ… Download LocalView (1 day)

    • Harvard Dataverse
    • 1,000-10,000 jurisdictions
  2. βœ… Run CDP integration (2 hours)

    • Already coded
    • 20 premium cities

MEDIUM PRIORITY (Optional):

  1. ⚠️ Scrape CivicBand list (4 hours)
    • 1,031 municipality names
    • Use for validation
    • Or contact founder for bulk access

🎯 Updated Integration Code

Add MeetingBank to your pipeline:

# discovery/meetingbank_ingestion.py

from datasets import load_dataset
from pyspark.sql import SparkSession
from loguru import logger

def load_meetingbank_to_bronze(spark: SparkSession) -> dict:
    """
    Load MeetingBank dataset to Bronze layer.
    
    MeetingBank contains 1,366 city council meetings from 6 major cities
    with full transcripts, summaries, and source URLs.
    """
    logger.info("Loading MeetingBank dataset from HuggingFace")
    
    # Download from HuggingFace
    meetingbank = load_dataset("huuuyeah/meetingbank")
    
    meetings = []
    
    for split in ['train', 'validation', 'test']:
        for instance in meetingbank[split]:
            meetings.append({
                "meeting_id": instance['id'],
                "jurisdiction_name": instance.get('city', 'Unknown'),
                "state_code": instance.get('state', 'Unknown'),
                "transcript": instance['transcript'],
                "summary_human": instance['summary'],
                "source_url": instance.get('url', ''),
                "date": instance.get('date', ''),
                "has_transcript": True,
                "has_summary": True,
                "has_url": bool(instance.get('url')),
                "transcript_length": len(instance['transcript']),
                "source": "meetingbank"
            })
    
    # Convert to DataFrame
    df = spark.createDataFrame(meetings)
    
    # Write to Bronze layer
    output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings"
    df.write \
        .format("delta") \
        .mode("overwrite") \
        .save(output_path)
    
    logger.info(f"βœ… Loaded {len(meetings)} meetings from MeetingBank")
    
    return {
        "total_meetings": len(meetings),
        "cities": 6,
        "source": "meetingbank"
    }

Test your keyword detection:

# Test keyword detection on MeetingBank transcripts
from datasets import load_dataset
from alerts.keyword_monitor import KeywordAlertSystem

meetingbank = load_dataset("huuuyeah/meetingbank")
alert_system = KeywordAlertSystem()

# Test on first 10 meetings
for instance in meetingbank['train'][:10]:
    matches = alert_system._find_keywords_in_text(
        instance['transcript'],
        alert_system.KEYWORD_CATEGORIES
    )
    
    if matches:
        print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found")
        for match in matches[:3]:  # Show first 3
            print(f"  - {match.keyword} ({match.category})")

Evaluate your AI summarization:

# Compare your summaries against human-written ground truth
from extraction.summarizer import MeetingSummarizer
from datasets import load_dataset

summarizer = MeetingSummarizer()
meetingbank = load_dataset("huuuyeah/meetingbank")

for instance in meetingbank['test'][:10]:
    # Generate your summary
    your_summary = summarizer.summarize(
        event=None,  # Create MeetingEvent from instance
        full_text=instance['transcript'],
        focus_on_health=False
    )
    
    # Compare against human summary
    human_summary = instance['summary']
    
    print(f"Meeting: {instance['id']}")
    print(f"Your summary: {your_summary.executive_summary}")
    print(f"Human summary: {human_summary}")
    print(f"Quality: {your_summary.confidence_score}")
    print()

πŸ“ˆ Expected Outcomes

Before MeetingBank:

  • 76 URLs discovered (15% match rate)
  • No evaluation benchmark
  • No ground truth for summarization

After MeetingBank:

  • +1,366 meetings with transcripts
  • +6 major cities with verified URLs
  • Academic benchmark for evaluation
  • Human summaries for quality validation
  • Total meetings: 1,366 ready to analyze immediately

πŸš€ Final Recommendation

DO THIS FIRST (2 hours):

# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'βœ… Downloaded {len(meetingbank[\"train\"])} meetings')
"

# 3. Create integration script
# See code example above

# 4. Test your keyword detection
# See test code above

# 5. Evaluate your summarization
# See evaluation code above

Expected Result:

  • Immediate access to 1,366 meetings
  • 6 major cities for prototyping
  • Academic quality benchmark
  • Proven ROI: Published in top NLP conference (ACL 2023)

Summary Table

Dataset Available? Download Time Meetings Usefulness
MeetingBank βœ… YES (HuggingFace) 5 minutes 1,366 πŸ”₯ VERY HIGH
LocalView βœ… YES (Harvard) 1 day 1,000-10,000 πŸ”₯ VERY HIGH
CDP βœ… YES (already coded) 2 hours 20 cities πŸ”₯ HIGH
CivicBand ⚠️ PARTIAL (need coordination) 4 hours 1,031 list 🟑 MEDIUM

Bottom line: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts.