Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /HUGGINGFACE_DATASETS_ANALYSIS.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

11.2 kB

✅ Confirmed: HuggingFace Datasets That WILL Help

Quick Answer: YES, 2 of 4 will help significantly!

Dataset	Status	Usefulness	Priority
MeetingBank	✅ READY TO USE	🔥 VERY HIGH	USE IMMEDIATELY
LocalView	✅ Already covered	HIGH	Download from Harvard
Council Data Project	✅ Already covered	HIGH	Already integrated
CivicBand	⚠️ Limited access	MEDIUM	Scrape municipality list

1. MeetingBank 🔥 (NEW! USE THIS!)

What It Is:

A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization

URLs:

HuggingFace (text): https://huggingface.co/datasets/huuuyeah/meetingbank
HuggingFace (audio): https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio
Zenodo (all files): https://zenodo.org/record/7989108
Archive.org (videos):

What You Get:

✅ 1,366 city council meetings from 6 cities:

Alameda, CA
Boston, MA
Denver, CO
King County, WA
Long Beach, CA
Seattle, WA

✅ 3,579 hours of video

✅ Full transcripts (average 28,000 tokens per meeting)

✅ PDF meeting minutes & agendas

✅ Human-written summaries (ground truth for evaluation)

✅ Machine-generated summaries (from 6 different systems)

✅ 6,892 segment-level summarization instances for training

Why This Is PERFECT for Your Project:

Immediate prototyping: Download from HuggingFace in 5 minutes

from datasets import load_dataset
meetingbank = load_dataset("huuuyeah/meetingbank")

for instance in meetingbank['train']:
    print(instance['id'])
    print(instance['summary'])
    print(instance['transcript'])

Quality validation: Compare your AI summarization against human-written summaries
URL discovery: Each meeting has source URLs to city websites
Benchmark your oral health keyword detection: Test against 1,366 real transcripts
Training data: If you want to fine-tune models for oral health policy

Paper:

"MeetingBank: A Benchmark Dataset for Meeting Summarization"
ACL 2023 (Association for Computational Linguistics)
https://arxiv.org/abs/2305.17529

🎯 ACTION PLAN:

# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'Loaded {len(meetingbank['train'])} training instances')
"

# 3. Create discovery/meetingbank_ingestion.py
# - Parse meetings
# - Extract URLs
# - Load to Bronze layer
# - Run keyword detection on transcripts
# - Evaluate against human summaries

Expected ROI:

Time: 2 hours to integrate
Value: 1,366 meetings with transcripts + summaries + URLs
Quality: Academic benchmark (peer-reviewed, ACL published)
Coverage: 6 major cities (all large, high-value for advocacy)

2. LocalView ✅ (Already Covered)

Status: Already identified in previous investigation
Location: Harvard Dataverse (doi:10.7910/DVN/NJTBEM)
Coverage: 1,000-10,000 jurisdictions
Action: Download from Harvard (already documented)

3. Council Data Project ✅ (Already Covered)

Status: Already integrated in external_url_datasets.py
Coverage: 20+ cities with full pipelines
Action: Already coded, just run the script

4. CivicBand ⚠️ (Limited Usefulness)

What It Is:

"Largest public collection of civic meeting and election finance data"
Website: https://civic.band/

What Exists:

✅ 1,031 municipalities tracked
✅ Millions of pages scraped (meeting minutes, agendas)
✅ Search interface available
✅ Publicly browsable

The Problem:

❌ "Dataset access is via their platform; raw dumps require coordination"

Can't directly download bulk URL list
Would need to contact founder (Philip James: hello@civic.band)
Or scrape the municipality list from their website

What You CAN Get:

The list of 1,031 municipalities is publicly visible on their site. You could:

Scrape the municipality list (city names + states)
Match against your Census data to get FIPS codes
Use as verification (these 1,031 are confirmed to have meeting data)

Limited Value Because:

Can't get direct URLs (need to coordinate with founder)
Already have larger coverage from LocalView (1,000-10,000 jurisdictions)
Already have premium coverage from CDP (20 cities)
CivicBand's main value is their content (scraped minutes), not URLs

Possible Action:

# Scrape CivicBand's municipality list
import requests
from bs4 import BeautifulSoup

response = requests.get("https://civic.band/")
soup = BeautifulSoup(response.text, 'html.parser')

# Parse the table of municipalities
# Match against Census data
# Use as validation list

Estimated value: MEDIUM (validation only, not bulk URLs)

📊 Revised Priority Ranking

IMMEDIATE (Do This Week):

🔥 Download MeetingBank (2 hours)
- HuggingFace dataset ready to use
- 1,366 meetings with transcripts, summaries, URLs
- Perfect for prototyping and evaluation

HIGH PRIORITY (Do This Month):

✅ Download LocalView (1 day)
- Harvard Dataverse
- 1,000-10,000 jurisdictions
✅ Run CDP integration (2 hours)
- Already coded
- 20 premium cities

MEDIUM PRIORITY (Optional):

⚠️ Scrape CivicBand list (4 hours)
- 1,031 municipality names
- Use for validation
- Or contact founder for bulk access

🎯 Updated Integration Code

Add MeetingBank to your pipeline:

# discovery/meetingbank_ingestion.py

from datasets import load_dataset
from pyspark.sql import SparkSession
from loguru import logger

def load_meetingbank_to_bronze(spark: SparkSession) -> dict:
    """
    Load MeetingBank dataset to Bronze layer.
    
    MeetingBank contains 1,366 city council meetings from 6 major cities
    with full transcripts, summaries, and source URLs.
    """
    logger.info("Loading MeetingBank dataset from HuggingFace")
    
    # Download from HuggingFace
    meetingbank = load_dataset("huuuyeah/meetingbank")
    
    meetings = []
    
    for split in ['train', 'validation', 'test']:
        for instance in meetingbank[split]:
            meetings.append({
                "meeting_id": instance['id'],
                "jurisdiction_name": instance.get('city', 'Unknown'),
                "state_code": instance.get('state', 'Unknown'),
                "transcript": instance['transcript'],
                "summary_human": instance['summary'],
                "source_url": instance.get('url', ''),
                "date": instance.get('date', ''),
                "has_transcript": True,
                "has_summary": True,
                "has_url": bool(instance.get('url')),
                "transcript_length": len(instance['transcript']),
                "source": "meetingbank"
            })
    
    # Convert to DataFrame
    df = spark.createDataFrame(meetings)
    
    # Write to Bronze layer
    output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings"
    df.write \
        .format("delta") \
        .mode("overwrite") \
        .save(output_path)
    
    logger.info(f"✅ Loaded {len(meetings)} meetings from MeetingBank")
    
    return {
        "total_meetings": len(meetings),
        "cities": 6,
        "source": "meetingbank"
    }

Test your keyword detection:

# Test keyword detection on MeetingBank transcripts
from datasets import load_dataset
from alerts.keyword_monitor import KeywordAlertSystem

meetingbank = load_dataset("huuuyeah/meetingbank")
alert_system = KeywordAlertSystem()

# Test on first 10 meetings
for instance in meetingbank['train'][:10]:
    matches = alert_system._find_keywords_in_text(
        instance['transcript'],
        alert_system.KEYWORD_CATEGORIES
    )
    
    if matches:
        print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found")
        for match in matches[:3]:  # Show first 3
            print(f"  - {match.keyword} ({match.category})")

Evaluate your AI summarization:

# Compare your summaries against human-written ground truth
from extraction.summarizer import MeetingSummarizer
from datasets import load_dataset

summarizer = MeetingSummarizer()
meetingbank = load_dataset("huuuyeah/meetingbank")

for instance in meetingbank['test'][:10]:
    # Generate your summary
    your_summary = summarizer.summarize(
        event=None,  # Create MeetingEvent from instance
        full_text=instance['transcript'],
        focus_on_health=False
    )
    
    # Compare against human summary
    human_summary = instance['summary']
    
    print(f"Meeting: {instance['id']}")
    print(f"Your summary: {your_summary.executive_summary}")
    print(f"Human summary: {human_summary}")
    print(f"Quality: {your_summary.confidence_score}")
    print()

📈 Expected Outcomes

Before MeetingBank:

76 URLs discovered (15% match rate)
No evaluation benchmark
No ground truth for summarization

After MeetingBank:

+1,366 meetings with transcripts
+6 major cities with verified URLs
Academic benchmark for evaluation
Human summaries for quality validation
Total meetings: 1,366 ready to analyze immediately

🚀 Final Recommendation

DO THIS FIRST (2 hours):

# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'✅ Downloaded {len(meetingbank[\"train\"])} meetings')
"

# 3. Create integration script
# See code example above

# 4. Test your keyword detection
# See test code above

# 5. Evaluate your summarization
# See evaluation code above

Expected Result:

Immediate access to 1,366 meetings
6 major cities for prototyping
Academic quality benchmark
Proven ROI: Published in top NLP conference (ACL 2023)

Summary Table

Dataset	Available?	Download Time	Meetings	Usefulness
MeetingBank	✅ YES (HuggingFace)	5 minutes	1,366	🔥 VERY HIGH
LocalView	✅ YES (Harvard)	1 day	1,000-10,000	🔥 VERY HIGH
CDP	✅ YES (already coded)	2 hours	20 cities	🔥 HIGH
CivicBand	⚠️ PARTIAL (need coordination)	4 hours	1,031 list	🟡 MEDIUM

Bottom line: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts.