open-navigator / docs /HUGGINGFACE_DATASETS_ANALYSIS.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# βœ… Confirmed: HuggingFace Datasets That WILL Help
## Quick Answer: YES, 2 of 4 will help significantly!
| Dataset | Status | Usefulness | Priority |
|---------|--------|------------|----------|
| **MeetingBank** | βœ… **READY TO USE** | πŸ”₯ **VERY HIGH** | **USE IMMEDIATELY** |
| **LocalView** | βœ… Already covered | HIGH | Download from Harvard |
| **Council Data Project** | βœ… Already covered | HIGH | Already integrated |
| **CivicBand** | ⚠️ Limited access | MEDIUM | Scrape municipality list |
---
## 1. MeetingBank πŸ”₯ (NEW! USE THIS!)
### What It Is:
**A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization**
### URLs:
- **HuggingFace (text)**: https://huggingface.co/datasets/huuuyeah/meetingbank
- **HuggingFace (audio)**: https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio
- **Zenodo (all files)**: https://zenodo.org/record/7989108
- **Archive.org (videos)**:
- https://archive.org/details/meetingbank-alameda
- https://archive.org/details/meetingbank-boston
- https://archive.org/details/meetingbank-denver
- https://archive.org/details/meetingbank-long-beach
- https://archive.org/details/meetingbank-king-county
- https://archive.org/details/meetingbank-seattle
### What You Get:
βœ… **1,366 city council meetings** from 6 cities:
- Alameda, CA
- Boston, MA
- Denver, CO
- King County, WA
- Long Beach, CA
- Seattle, WA
βœ… **3,579 hours of video**
βœ… **Full transcripts** (average 28,000 tokens per meeting)
βœ… **PDF meeting minutes & agendas**
βœ… **Human-written summaries** (ground truth for evaluation)
βœ… **Machine-generated summaries** (from 6 different systems)
βœ… **6,892 segment-level summarization instances** for training
### Why This Is PERFECT for Your Project:
1. **Immediate prototyping**: Download from HuggingFace in 5 minutes
```python
from datasets import load_dataset
meetingbank = load_dataset("huuuyeah/meetingbank")
for instance in meetingbank['train']:
print(instance['id'])
print(instance['summary'])
print(instance['transcript'])
```
2. **Quality validation**: Compare your AI summarization against human-written summaries
3. **URL discovery**: Each meeting has source URLs to city websites
4. **Benchmark your oral health keyword detection**: Test against 1,366 real transcripts
5. **Training data**: If you want to fine-tune models for oral health policy
### Paper:
"MeetingBank: A Benchmark Dataset for Meeting Summarization"
ACL 2023 (Association for Computational Linguistics)
https://arxiv.org/abs/2305.17529
### 🎯 ACTION PLAN:
```bash
# 1. Install HuggingFace datasets
pip install datasets
# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'Loaded {len(meetingbank['train'])} training instances')
"
# 3. Create discovery/meetingbank_ingestion.py
# - Parse meetings
# - Extract URLs
# - Load to Bronze layer
# - Run keyword detection on transcripts
# - Evaluate against human summaries
```
### Expected ROI:
- **Time**: 2 hours to integrate
- **Value**: 1,366 meetings with transcripts + summaries + URLs
- **Quality**: Academic benchmark (peer-reviewed, ACL published)
- **Coverage**: 6 major cities (all large, high-value for advocacy)
---
## 2. LocalView βœ… (Already Covered)
**Status**: Already identified in previous investigation
**Location**: Harvard Dataverse (doi:10.7910/DVN/NJTBEM)
**Coverage**: 1,000-10,000 jurisdictions
**Action**: Download from Harvard (already documented)
---
## 3. Council Data Project βœ… (Already Covered)
**Status**: Already integrated in [`external_url_datasets.py`](../discovery/external_url_datasets.py)
**Coverage**: 20+ cities with full pipelines
**Action**: Already coded, just run the script
---
## 4. CivicBand ⚠️ (Limited Usefulness)
### What It Is:
"Largest public collection of civic meeting and election finance data"
Website: https://civic.band/
### What Exists:
βœ… **1,031 municipalities tracked**
βœ… Millions of pages scraped (meeting minutes, agendas)
βœ… Search interface available
βœ… Publicly browsable
### The Problem:
❌ **"Dataset access is via their platform; raw dumps require coordination"**
- Can't directly download bulk URL list
- Would need to contact founder (Philip James: hello@civic.band)
- Or scrape the municipality list from their website
### What You CAN Get:
The list of 1,031 municipalities is publicly visible on their site. You could:
1. **Scrape the municipality list** (city names + states)
2. **Match against your Census data** to get FIPS codes
3. **Use as verification** (these 1,031 are confirmed to have meeting data)
### Limited Value Because:
- Can't get direct URLs (need to coordinate with founder)
- Already have larger coverage from LocalView (1,000-10,000 jurisdictions)
- Already have premium coverage from CDP (20 cities)
- CivicBand's main value is their *content* (scraped minutes), not URLs
### Possible Action:
```python
# Scrape CivicBand's municipality list
import requests
from bs4 import BeautifulSoup
response = requests.get("https://civic.band/")
soup = BeautifulSoup(response.text, 'html.parser')
# Parse the table of municipalities
# Match against Census data
# Use as validation list
```
**Estimated value**: MEDIUM (validation only, not bulk URLs)
---
## πŸ“Š Revised Priority Ranking
### IMMEDIATE (Do This Week):
1. πŸ”₯ **Download MeetingBank** (2 hours)
- HuggingFace dataset ready to use
- 1,366 meetings with transcripts, summaries, URLs
- Perfect for prototyping and evaluation
### HIGH PRIORITY (Do This Month):
2. βœ… **Download LocalView** (1 day)
- Harvard Dataverse
- 1,000-10,000 jurisdictions
3. βœ… **Run CDP integration** (2 hours)
- Already coded
- 20 premium cities
### MEDIUM PRIORITY (Optional):
4. ⚠️ **Scrape CivicBand list** (4 hours)
- 1,031 municipality names
- Use for validation
- Or contact founder for bulk access
---
## 🎯 Updated Integration Code
### Add MeetingBank to your pipeline:
```python
# discovery/meetingbank_ingestion.py
from datasets import load_dataset
from pyspark.sql import SparkSession
from loguru import logger
def load_meetingbank_to_bronze(spark: SparkSession) -> dict:
"""
Load MeetingBank dataset to Bronze layer.
MeetingBank contains 1,366 city council meetings from 6 major cities
with full transcripts, summaries, and source URLs.
"""
logger.info("Loading MeetingBank dataset from HuggingFace")
# Download from HuggingFace
meetingbank = load_dataset("huuuyeah/meetingbank")
meetings = []
for split in ['train', 'validation', 'test']:
for instance in meetingbank[split]:
meetings.append({
"meeting_id": instance['id'],
"jurisdiction_name": instance.get('city', 'Unknown'),
"state_code": instance.get('state', 'Unknown'),
"transcript": instance['transcript'],
"summary_human": instance['summary'],
"source_url": instance.get('url', ''),
"date": instance.get('date', ''),
"has_transcript": True,
"has_summary": True,
"has_url": bool(instance.get('url')),
"transcript_length": len(instance['transcript']),
"source": "meetingbank"
})
# Convert to DataFrame
df = spark.createDataFrame(meetings)
# Write to Bronze layer
output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings"
df.write \
.format("delta") \
.mode("overwrite") \
.save(output_path)
logger.info(f"βœ… Loaded {len(meetings)} meetings from MeetingBank")
return {
"total_meetings": len(meetings),
"cities": 6,
"source": "meetingbank"
}
```
### Test your keyword detection:
```python
# Test keyword detection on MeetingBank transcripts
from datasets import load_dataset
from alerts.keyword_monitor import KeywordAlertSystem
meetingbank = load_dataset("huuuyeah/meetingbank")
alert_system = KeywordAlertSystem()
# Test on first 10 meetings
for instance in meetingbank['train'][:10]:
matches = alert_system._find_keywords_in_text(
instance['transcript'],
alert_system.KEYWORD_CATEGORIES
)
if matches:
print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found")
for match in matches[:3]: # Show first 3
print(f" - {match.keyword} ({match.category})")
```
### Evaluate your AI summarization:
```python
# Compare your summaries against human-written ground truth
from extraction.summarizer import MeetingSummarizer
from datasets import load_dataset
summarizer = MeetingSummarizer()
meetingbank = load_dataset("huuuyeah/meetingbank")
for instance in meetingbank['test'][:10]:
# Generate your summary
your_summary = summarizer.summarize(
event=None, # Create MeetingEvent from instance
full_text=instance['transcript'],
focus_on_health=False
)
# Compare against human summary
human_summary = instance['summary']
print(f"Meeting: {instance['id']}")
print(f"Your summary: {your_summary.executive_summary}")
print(f"Human summary: {human_summary}")
print(f"Quality: {your_summary.confidence_score}")
print()
```
---
## πŸ“ˆ Expected Outcomes
### Before MeetingBank:
- 76 URLs discovered (15% match rate)
- No evaluation benchmark
- No ground truth for summarization
### After MeetingBank:
- **+1,366 meetings** with transcripts
- **+6 major cities** with verified URLs
- **Academic benchmark** for evaluation
- **Human summaries** for quality validation
- **Total meetings**: 1,366 ready to analyze immediately
---
## πŸš€ Final Recommendation
### DO THIS FIRST (2 hours):
```bash
# 1. Install HuggingFace datasets
pip install datasets
# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'βœ… Downloaded {len(meetingbank[\"train\"])} meetings')
"
# 3. Create integration script
# See code example above
# 4. Test your keyword detection
# See test code above
# 5. Evaluate your summarization
# See evaluation code above
```
### Expected Result:
- **Immediate access** to 1,366 meetings
- **6 major cities** for prototyping
- **Academic quality** benchmark
- **Proven ROI**: Published in top NLP conference (ACL 2023)
---
## Summary Table
| Dataset | Available? | Download Time | Meetings | Usefulness |
|---------|-----------|---------------|----------|------------|
| **MeetingBank** | βœ… **YES** (HuggingFace) | **5 minutes** | **1,366** | πŸ”₯ **VERY HIGH** |
| **LocalView** | βœ… YES (Harvard) | 1 day | 1,000-10,000 | πŸ”₯ VERY HIGH |
| **CDP** | βœ… YES (already coded) | 2 hours | 20 cities | πŸ”₯ HIGH |
| **CivicBand** | ⚠️ PARTIAL (need coordination) | 4 hours | 1,031 list | 🟑 MEDIUM |
**Bottom line**: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts.