# ✅ Confirmed: HuggingFace Datasets That WILL Help

## Quick Answer: YES, 2 of 4 will help significantly!

| Dataset | Status | Usefulness | Priority |
|---------|--------|------------|----------|
| **MeetingBank** | ✅ **READY TO USE** | 🔥 **VERY HIGH** | **USE IMMEDIATELY** |
| **LocalView** | ✅ Already covered | HIGH | Download from Harvard |
| **Council Data Project** | ✅ Already covered | HIGH | Already integrated |
| **CivicBand** | ⚠️ Limited access | MEDIUM | Scrape municipality list |

---

## 1. MeetingBank 🔥 (NEW! USE THIS!)

### What It Is:
**A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization**

### URLs:
- **HuggingFace (text)**: https://huggingface.co/datasets/huuuyeah/meetingbank
- **HuggingFace (audio)**: https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio
- **Zenodo (all files)**: https://zenodo.org/record/7989108
- **Archive.org (videos)**: 
  - https://archive.org/details/meetingbank-alameda
  - https://archive.org/details/meetingbank-boston
  - https://archive.org/details/meetingbank-denver
  - https://archive.org/details/meetingbank-long-beach
  - https://archive.org/details/meetingbank-king-county
  - https://archive.org/details/meetingbank-seattle

### What You Get:
✅ **1,366 city council meetings** from 6 cities:
   - Alameda, CA
   - Boston, MA
   - Denver, CO
   - King County, WA
   - Long Beach, CA
   - Seattle, WA

✅ **3,579 hours of video**

✅ **Full transcripts** (average 28,000 tokens per meeting)

✅ **PDF meeting minutes & agendas**

✅ **Human-written summaries** (ground truth for evaluation)

✅ **Machine-generated summaries** (from 6 different systems)

✅ **6,892 segment-level summarization instances** for training

### Why This Is PERFECT for Your Project:

1. **Immediate prototyping**: Download from HuggingFace in 5 minutes
   ```python
   from datasets import load_dataset
   meetingbank = load_dataset("huuuyeah/meetingbank")
   
   for instance in meetingbank['train']:
       print(instance['id'])
       print(instance['summary'])
       print(instance['transcript'])
   ```

2. **Quality validation**: Compare your AI summarization against human-written summaries

3. **URL discovery**: Each meeting has source URLs to city websites

4. **Benchmark your oral health keyword detection**: Test against 1,366 real transcripts

5. **Training data**: If you want to fine-tune models for oral health policy

### Paper:
"MeetingBank: A Benchmark Dataset for Meeting Summarization"  
ACL 2023 (Association for Computational Linguistics)  
https://arxiv.org/abs/2305.17529

### 🎯 ACTION PLAN:
```bash
# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'Loaded {len(meetingbank['train'])} training instances')
"

# 3. Create discovery/meetingbank_ingestion.py
# - Parse meetings
# - Extract URLs
# - Load to Bronze layer
# - Run keyword detection on transcripts
# - Evaluate against human summaries
```

### Expected ROI:
- **Time**: 2 hours to integrate
- **Value**: 1,366 meetings with transcripts + summaries + URLs
- **Quality**: Academic benchmark (peer-reviewed, ACL published)
- **Coverage**: 6 major cities (all large, high-value for advocacy)

---

## 2. LocalView ✅ (Already Covered)

**Status**: Already identified in previous investigation  
**Location**: Harvard Dataverse (doi:10.7910/DVN/NJTBEM)  
**Coverage**: 1,000-10,000 jurisdictions  
**Action**: Download from Harvard (already documented)

---

## 3. Council Data Project ✅ (Already Covered)

**Status**: Already integrated in [`external_url_datasets.py`](../discovery/external_url_datasets.py)  
**Coverage**: 20+ cities with full pipelines  
**Action**: Already coded, just run the script

---

## 4. CivicBand ⚠️ (Limited Usefulness)

### What It Is:
"Largest public collection of civic meeting and election finance data"  
Website: https://civic.band/

### What Exists:
✅ **1,031 municipalities tracked**  
✅ Millions of pages scraped (meeting minutes, agendas)  
✅ Search interface available  
✅ Publicly browsable

### The Problem:
❌ **"Dataset access is via their platform; raw dumps require coordination"**
- Can't directly download bulk URL list
- Would need to contact founder (Philip James: hello@civic.band)
- Or scrape the municipality list from their website

### What You CAN Get:
The list of 1,031 municipalities is publicly visible on their site. You could:

1. **Scrape the municipality list** (city names + states)
2. **Match against your Census data** to get FIPS codes
3. **Use as verification** (these 1,031 are confirmed to have meeting data)

### Limited Value Because:
- Can't get direct URLs (need to coordinate with founder)
- Already have larger coverage from LocalView (1,000-10,000 jurisdictions)
- Already have premium coverage from CDP (20 cities)
- CivicBand's main value is their *content* (scraped minutes), not URLs

### Possible Action:
```python
# Scrape CivicBand's municipality list
import requests
from bs4 import BeautifulSoup

response = requests.get("https://civic.band/")
soup = BeautifulSoup(response.text, 'html.parser')

# Parse the table of municipalities
# Match against Census data
# Use as validation list
```

**Estimated value**: MEDIUM (validation only, not bulk URLs)

---

## 📊 Revised Priority Ranking

### IMMEDIATE (Do This Week):
1. 🔥 **Download MeetingBank** (2 hours)
   - HuggingFace dataset ready to use
   - 1,366 meetings with transcripts, summaries, URLs
   - Perfect for prototyping and evaluation

### HIGH PRIORITY (Do This Month):
2. ✅ **Download LocalView** (1 day)
   - Harvard Dataverse
   - 1,000-10,000 jurisdictions

3. ✅ **Run CDP integration** (2 hours)
   - Already coded
   - 20 premium cities

### MEDIUM PRIORITY (Optional):
4. ⚠️ **Scrape CivicBand list** (4 hours)
   - 1,031 municipality names
   - Use for validation
   - Or contact founder for bulk access

---

## 🎯 Updated Integration Code

### Add MeetingBank to your pipeline:

```python
# discovery/meetingbank_ingestion.py

from datasets import load_dataset
from pyspark.sql import SparkSession
from loguru import logger

def load_meetingbank_to_bronze(spark: SparkSession) -> dict:
    """
    Load MeetingBank dataset to Bronze layer.
    
    MeetingBank contains 1,366 city council meetings from 6 major cities
    with full transcripts, summaries, and source URLs.
    """
    logger.info("Loading MeetingBank dataset from HuggingFace")
    
    # Download from HuggingFace
    meetingbank = load_dataset("huuuyeah/meetingbank")
    
    meetings = []
    
    for split in ['train', 'validation', 'test']:
        for instance in meetingbank[split]:
            meetings.append({
                "meeting_id": instance['id'],
                "jurisdiction_name": instance.get('city', 'Unknown'),
                "state_code": instance.get('state', 'Unknown'),
                "transcript": instance['transcript'],
                "summary_human": instance['summary'],
                "source_url": instance.get('url', ''),
                "date": instance.get('date', ''),
                "has_transcript": True,
                "has_summary": True,
                "has_url": bool(instance.get('url')),
                "transcript_length": len(instance['transcript']),
                "source": "meetingbank"
            })
    
    # Convert to DataFrame
    df = spark.createDataFrame(meetings)
    
    # Write to Bronze layer
    output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings"
    df.write \
        .format("delta") \
        .mode("overwrite") \
        .save(output_path)
    
    logger.info(f"✅ Loaded {len(meetings)} meetings from MeetingBank")
    
    return {
        "total_meetings": len(meetings),
        "cities": 6,
        "source": "meetingbank"
    }
```

### Test your keyword detection:

```python
# Test keyword detection on MeetingBank transcripts
from datasets import load_dataset
from alerts.keyword_monitor import KeywordAlertSystem

meetingbank = load_dataset("huuuyeah/meetingbank")
alert_system = KeywordAlertSystem()

# Test on first 10 meetings
for instance in meetingbank['train'][:10]:
    matches = alert_system._find_keywords_in_text(
        instance['transcript'],
        alert_system.KEYWORD_CATEGORIES
    )
    
    if matches:
        print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found")
        for match in matches[:3]:  # Show first 3
            print(f"  - {match.keyword} ({match.category})")
```

### Evaluate your AI summarization:

```python
# Compare your summaries against human-written ground truth
from extraction.summarizer import MeetingSummarizer
from datasets import load_dataset

summarizer = MeetingSummarizer()
meetingbank = load_dataset("huuuyeah/meetingbank")

for instance in meetingbank['test'][:10]:
    # Generate your summary
    your_summary = summarizer.summarize(
        event=None,  # Create MeetingEvent from instance
        full_text=instance['transcript'],
        focus_on_health=False
    )
    
    # Compare against human summary
    human_summary = instance['summary']
    
    print(f"Meeting: {instance['id']}")
    print(f"Your summary: {your_summary.executive_summary}")
    print(f"Human summary: {human_summary}")
    print(f"Quality: {your_summary.confidence_score}")
    print()
```

---

## 📈 Expected Outcomes

### Before MeetingBank:
- 76 URLs discovered (15% match rate)
- No evaluation benchmark
- No ground truth for summarization

### After MeetingBank:
- **+1,366 meetings** with transcripts
- **+6 major cities** with verified URLs
- **Academic benchmark** for evaluation
- **Human summaries** for quality validation
- **Total meetings**: 1,366 ready to analyze immediately

---

## 🚀 Final Recommendation

### DO THIS FIRST (2 hours):
```bash
# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'✅ Downloaded {len(meetingbank[\"train\"])} meetings')
"

# 3. Create integration script
# See code example above

# 4. Test your keyword detection
# See test code above

# 5. Evaluate your summarization
# See evaluation code above
```

### Expected Result:
- **Immediate access** to 1,366 meetings
- **6 major cities** for prototyping
- **Academic quality** benchmark
- **Proven ROI**: Published in top NLP conference (ACL 2023)

---

## Summary Table

| Dataset | Available? | Download Time | Meetings | Usefulness |
|---------|-----------|---------------|----------|------------|
| **MeetingBank** | ✅ **YES** (HuggingFace) | **5 minutes** | **1,366** | 🔥 **VERY HIGH** |
| **LocalView** | ✅ YES (Harvard) | 1 day | 1,000-10,000 | 🔥 VERY HIGH |
| **CDP** | ✅ YES (already coded) | 2 hours | 20 cities | 🔥 HIGH |
| **CivicBand** | ⚠️ PARTIAL (need coordination) | 4 hours | 1,031 list | 🟡 MEDIUM |

**Bottom line**: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts.