Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /HUGGINGFACE_DATASETS_ANALYSIS.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

11.2 kB

	# ✅ Confirmed: HuggingFace Datasets That WILL Help

	## Quick Answer: YES, 2 of 4 will help significantly!

	\| Dataset \| Status \| Usefulness \| Priority \|
	\|---------\|--------\|------------\|----------\|
	\| MeetingBank \| ✅ READY TO USE \| 🔥 VERY HIGH \| USE IMMEDIATELY \|
	\| LocalView \| ✅ Already covered \| HIGH \| Download from Harvard \|
	\| Council Data Project \| ✅ Already covered \| HIGH \| Already integrated \|
	\| CivicBand \| ⚠️ Limited access \| MEDIUM \| Scrape municipality list \|

	---

	## 1. MeetingBank 🔥 (NEW! USE THIS!)

	### What It Is:
	A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization

	### URLs:
	- HuggingFace (text): https://huggingface.co/datasets/huuuyeah/meetingbank
	- HuggingFace (audio): https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio
	- Zenodo (all files): https://zenodo.org/record/7989108
	- Archive.org (videos):
	- https://archive.org/details/meetingbank-alameda
	- https://archive.org/details/meetingbank-boston
	- https://archive.org/details/meetingbank-denver
	- https://archive.org/details/meetingbank-long-beach
	- https://archive.org/details/meetingbank-king-county
	- https://archive.org/details/meetingbank-seattle

	### What You Get:
	✅ 1,366 city council meetings from 6 cities:
	- Alameda, CA
	- Boston, MA
	- Denver, CO
	- King County, WA
	- Long Beach, CA
	- Seattle, WA

	✅ 3,579 hours of video

	✅ Full transcripts (average 28,000 tokens per meeting)

	✅ PDF meeting minutes & agendas

	✅ Human-written summaries (ground truth for evaluation)

	✅ Machine-generated summaries (from 6 different systems)

	✅ 6,892 segment-level summarization instances for training

	### Why This Is PERFECT for Your Project:

	1. Immediate prototyping: Download from HuggingFace in 5 minutes
	```python
	from datasets import load_dataset
	meetingbank = load_dataset("huuuyeah/meetingbank")

	for instance in meetingbank['train']:
	print(instance['id'])
	print(instance['summary'])
	print(instance['transcript'])
	```

	2. Quality validation: Compare your AI summarization against human-written summaries

	3. URL discovery: Each meeting has source URLs to city websites

	4. Benchmark your oral health keyword detection: Test against 1,366 real transcripts

	5. Training data: If you want to fine-tune models for oral health policy

	### Paper:
	"MeetingBank: A Benchmark Dataset for Meeting Summarization"
	ACL 2023 (Association for Computational Linguistics)
	https://arxiv.org/abs/2305.17529

	### 🎯 ACTION PLAN:
	```bash
	# 1. Install HuggingFace datasets
	pip install datasets

	# 2. Download MeetingBank
	python -c "
	from datasets import load_dataset
	meetingbank = load_dataset('huuuyeah/meetingbank')
	print(f'Loaded {len(meetingbank['train'])} training instances')
	"

	# 3. Create discovery/meetingbank_ingestion.py
	# - Parse meetings
	# - Extract URLs
	# - Load to Bronze layer
	# - Run keyword detection on transcripts
	# - Evaluate against human summaries
	```

	### Expected ROI:
	- Time: 2 hours to integrate
	- Value: 1,366 meetings with transcripts + summaries + URLs
	- Quality: Academic benchmark (peer-reviewed, ACL published)
	- Coverage: 6 major cities (all large, high-value for advocacy)

	---

	## 2. LocalView ✅ (Already Covered)

	Status: Already identified in previous investigation
	Location: Harvard Dataverse (doi:10.7910/DVN/NJTBEM)
	Coverage: 1,000-10,000 jurisdictions
	Action: Download from Harvard (already documented)

	---

	## 3. Council Data Project ✅ (Already Covered)

	Status: Already integrated in [`external_url_datasets.py`](../discovery/external_url_datasets.py)
	Coverage: 20+ cities with full pipelines
	Action: Already coded, just run the script

	---

	## 4. CivicBand ⚠️ (Limited Usefulness)

	### What It Is:
	"Largest public collection of civic meeting and election finance data"
	Website: https://civic.band/

	### What Exists:
	✅ 1,031 municipalities tracked
	✅ Millions of pages scraped (meeting minutes, agendas)
	✅ Search interface available
	✅ Publicly browsable

	### The Problem:
	❌ "Dataset access is via their platform; raw dumps require coordination"
	- Can't directly download bulk URL list
	- Would need to contact founder (Philip James: hello@civic.band)
	- Or scrape the municipality list from their website

	### What You CAN Get:
	The list of 1,031 municipalities is publicly visible on their site. You could:

	1. Scrape the municipality list (city names + states)
	2. Match against your Census data to get FIPS codes
	3. Use as verification (these 1,031 are confirmed to have meeting data)

	### Limited Value Because:
	- Can't get direct URLs (need to coordinate with founder)
	- Already have larger coverage from LocalView (1,000-10,000 jurisdictions)
	- Already have premium coverage from CDP (20 cities)
	- CivicBand's main value is their content (scraped minutes), not URLs

	### Possible Action:
	```python
	# Scrape CivicBand's municipality list
	import requests
	from bs4 import BeautifulSoup

	response = requests.get("https://civic.band/")
	soup = BeautifulSoup(response.text, 'html.parser')

	# Parse the table of municipalities
	# Match against Census data
	# Use as validation list
	```

	Estimated value: MEDIUM (validation only, not bulk URLs)

	---

	## 📊 Revised Priority Ranking

	### IMMEDIATE (Do This Week):
	1. 🔥 Download MeetingBank (2 hours)
	- HuggingFace dataset ready to use
	- 1,366 meetings with transcripts, summaries, URLs
	- Perfect for prototyping and evaluation

	### HIGH PRIORITY (Do This Month):
	2. ✅ Download LocalView (1 day)
	- Harvard Dataverse
	- 1,000-10,000 jurisdictions

	3. ✅ Run CDP integration (2 hours)
	- Already coded
	- 20 premium cities

	### MEDIUM PRIORITY (Optional):
	4. ⚠️ Scrape CivicBand list (4 hours)
	- 1,031 municipality names
	- Use for validation
	- Or contact founder for bulk access

	---

	## 🎯 Updated Integration Code

	### Add MeetingBank to your pipeline:

	```python
	# discovery/meetingbank_ingestion.py

	from datasets import load_dataset
	from pyspark.sql import SparkSession
	from loguru import logger

	def load_meetingbank_to_bronze(spark: SparkSession) -> dict:
	"""
	Load MeetingBank dataset to Bronze layer.

	MeetingBank contains 1,366 city council meetings from 6 major cities
	with full transcripts, summaries, and source URLs.
	"""
	logger.info("Loading MeetingBank dataset from HuggingFace")

	# Download from HuggingFace
	meetingbank = load_dataset("huuuyeah/meetingbank")

	meetings = []

	for split in ['train', 'validation', 'test']:
	for instance in meetingbank[split]:
	meetings.append({
	"meeting_id": instance['id'],
	"jurisdiction_name": instance.get('city', 'Unknown'),
	"state_code": instance.get('state', 'Unknown'),
	"transcript": instance['transcript'],
	"summary_human": instance['summary'],
	"source_url": instance.get('url', ''),
	"date": instance.get('date', ''),
	"has_transcript": True,
	"has_summary": True,
	"has_url": bool(instance.get('url')),
	"transcript_length": len(instance['transcript']),
	"source": "meetingbank"
	})

	# Convert to DataFrame
	df = spark.createDataFrame(meetings)

	# Write to Bronze layer
	output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings"
	df.write \
	.format("delta") \
	.mode("overwrite") \
	.save(output_path)

	logger.info(f"✅ Loaded {len(meetings)} meetings from MeetingBank")

	return {
	"total_meetings": len(meetings),
	"cities": 6,
	"source": "meetingbank"
	}
	```

	### Test your keyword detection:

	```python
	# Test keyword detection on MeetingBank transcripts
	from datasets import load_dataset
	from alerts.keyword_monitor import KeywordAlertSystem

	meetingbank = load_dataset("huuuyeah/meetingbank")
	alert_system = KeywordAlertSystem()

	# Test on first 10 meetings
	for instance in meetingbank['train'][:10]:
	matches = alert_system._find_keywords_in_text(
	instance['transcript'],
	alert_system.KEYWORD_CATEGORIES
	)

	if matches:
	print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found")
	for match in matches[:3]: # Show first 3
	print(f" - {match.keyword} ({match.category})")
	```

	### Evaluate your AI summarization:

	```python
	# Compare your summaries against human-written ground truth
	from extraction.summarizer import MeetingSummarizer
	from datasets import load_dataset

	summarizer = MeetingSummarizer()
	meetingbank = load_dataset("huuuyeah/meetingbank")

	for instance in meetingbank['test'][:10]:
	# Generate your summary
	your_summary = summarizer.summarize(
	event=None, # Create MeetingEvent from instance
	full_text=instance['transcript'],
	focus_on_health=False
	)

	# Compare against human summary
	human_summary = instance['summary']

	print(f"Meeting: {instance['id']}")
	print(f"Your summary: {your_summary.executive_summary}")
	print(f"Human summary: {human_summary}")
	print(f"Quality: {your_summary.confidence_score}")
	print()
	```

	---

	## 📈 Expected Outcomes

	### Before MeetingBank:
	- 76 URLs discovered (15% match rate)
	- No evaluation benchmark
	- No ground truth for summarization

	### After MeetingBank:
	- +1,366 meetings with transcripts
	- +6 major cities with verified URLs
	- Academic benchmark for evaluation
	- Human summaries for quality validation
	- Total meetings: 1,366 ready to analyze immediately

	---

	## 🚀 Final Recommendation

	### DO THIS FIRST (2 hours):
	```bash
	# 1. Install HuggingFace datasets
	pip install datasets

	# 2. Download MeetingBank
	python -c "
	from datasets import load_dataset
	meetingbank = load_dataset('huuuyeah/meetingbank')
	print(f'✅ Downloaded {len(meetingbank[\"train\"])} meetings')
	"

	# 3. Create integration script
	# See code example above

	# 4. Test your keyword detection
	# See test code above

	# 5. Evaluate your summarization
	# See evaluation code above
	```

	### Expected Result:
	- Immediate access to 1,366 meetings
	- 6 major cities for prototyping
	- Academic quality benchmark
	- Proven ROI: Published in top NLP conference (ACL 2023)

	---

	## Summary Table

	\| Dataset \| Available? \| Download Time \| Meetings \| Usefulness \|
	\|---------\|-----------\|---------------\|----------\|------------\|
	\| MeetingBank \| ✅ YES (HuggingFace) \| 5 minutes \| 1,366 \| 🔥 VERY HIGH \|
	\| LocalView \| ✅ YES (Harvard) \| 1 day \| 1,000-10,000 \| 🔥 VERY HIGH \|
	\| CDP \| ✅ YES (already coded) \| 2 hours \| 20 cities \| 🔥 HIGH \|
	\| CivicBand \| ⚠️ PARTIAL (need coordination) \| 4 hours \| 1,031 list \| 🟡 MEDIUM \|

	Bottom line: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts.