Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /LOCALVIEW_INTEGRATION_GUIDE.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

7.59 kB

	# 📚 LocalView Integration Guide

	## Overview

	LocalView is a Harvard University dataset containing 1,000-10,000 municipality URLs with meeting videos and transcripts. It's the largest known database of municipal meeting video archives.

	Challenge: The Harvard Dataverse requires JavaScript and may have CAPTCHA verification, so we need to download the files manually.

	---

	## Step-by-Step Download Instructions

	### 1. Visit the Harvard Dataverse Website

	URL: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM

	### 2. Navigate to the Files Section

	Once the page loads:
	1. Scroll down to the "Files" section
	2. Look for downloadable CSV/TAB files with names like:
	- `municipalities.csv` or `municipalities.tab`
	- `meetings.csv` or `meetings.tab`
	- `videos.csv` or `videos.tab`
	- Or similar naming patterns

	### 3. Download the Files

	Click the "Download" button for each file:
	- Download all CSV/TAB files related to municipalities, meetings, and videos
	- Save them to: `/home/developer/projects/open-navigator/data/cache/localview/`

	Expected files (names may vary):
	```
	data/cache/localview/
	├── municipalities.csv # List of municipalities with URLs
	├── meetings.csv # Meeting metadata
	├── videos.csv # Video URLs and metadata
	└── README.txt # Dataset documentation (if available)
	```

	### 4. Expected Data Structure

	The LocalView dataset typically includes:

	Municipalities file (municipalities.csv):
	- `municipality_name` - City/town name
	- `state` - Two-letter state code
	- `county` - County name
	- `population` - Population count
	- `website_url` - Official government website
	- `meeting_page_url` - Link to meetings/agendas page
	- `video_archive_url` - Link to video archive

	Meetings file (meetings.csv):
	- `meeting_id` - Unique identifier
	- `municipality_name` - City/town name
	- `meeting_date` - Date of meeting
	- `meeting_type` - Type (Council, Planning, etc.)
	- `video_url` - Direct link to video
	- `transcript_available` - Boolean flag
	- `transcript_url` - Link to transcript (if available)

	Videos file (videos.csv):
	- `video_id` - Unique identifier
	- `video_url` - Direct video link
	- `platform` - Platform (YouTube, Granicus, etc.)
	- `duration_minutes` - Video length
	- `has_captions` - Caption availability

	---

	## Integration Script Usage

	### After Downloading Files

	Once you've downloaded the files to `data/cache/localview/`, run:

	```bash
	cd /home/developer/projects/open-navigator
	source venv/bin/activate

	# Run the ingestion script
	python discovery/localview_ingestion.py
	```

	### What the Script Does

	1. Reads downloaded CSV files from cache directory
	2. Parses municipality data - Names, states, URLs
	3. Extracts video URLs - Direct links to meeting videos
	4. Identifies platforms - YouTube, Granicus, Vimeo, Archive.org
	5. Writes to Bronze layer - `bronze/localview_municipalities` and `bronze/localview_videos`

	### Expected Output

	```
	[INFO] Loading LocalView data from cache...
	[INFO] Found 1,247 municipalities
	[INFO] Found 8,453 meeting videos
	[INFO] Platforms detected:
	- YouTube: 3,421 videos
	- Granicus: 2,876 videos
	- Vimeo: 1,234 videos
	- Other: 922 videos
	[SUCCESS] ✓ Written 1,247 municipalities to bronze/localview_municipalities
	[SUCCESS] ✓ Written 8,453 videos to bronze/localview_videos
	```

	---

	## Alternative: API Access (If Available)

	Check if LocalView provides API access:

	Some Harvard Dataverse datasets offer API access. Try:

	```bash
	# Check for API availability
	curl -I "https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/NJTBEM"
	```

	If successful, we can update the script to use the API instead of manual download.

	---

	## Troubleshooting

	### Problem: Can't Find CSV Files

	Solution: The files might be in TAB format. The ingestion script handles both CSV and TAB files automatically.

	### Problem: Files Have Different Names

	Solution: Edit the `EXPECTED_FILES` dictionary in `discovery/localview_ingestion.py` to match the actual filenames.

	### Problem: Data Format is Different

	Solution: Check the README.txt or dataset documentation on Harvard Dataverse. Update the column mappings in the script to match.

	### Problem: CAPTCHA Blocks Download

	Solution:
	1. Use a web browser (not curl/wget)
	2. Complete the CAPTCHA verification
	3. Download files manually through the browser
	4. Save to `data/cache/localview/`

	---

	## Data Quality & Coverage

	### Expected Coverage

	Based on the LocalView research paper:

	- 1,000-1,500 municipalities with verified meeting archives
	- 5,000-10,000 meeting videos with URLs
	- Coverage: Major cities + medium-sized municipalities
	- Time range: 2015-2024 (approximately)
	- Focus states: CA, MA, TX, NY, FL, IL (highest coverage)

	### Quality Indicators

	- ✅ Academic validation - Harvard research project
	- ✅ Human verification - URLs manually verified
	- ✅ Transcript availability - Many include automated transcripts
	- ✅ Continuous updates - Dataset may be updated periodically

	---

	## Next Steps After Integration

	### 1. Combine with Other Sources

	```bash
	# After running LocalView ingestion
	python discovery/meetingbank_ingestion.py # 1,366 meetings
	python discovery/city_scrapers_urls.py # 100-500 agencies
	python discovery/openstates_sources.py # 50+ legislatures

	# Total: 7,000-12,000 verified URLs!
	```

	### 2. Deduplicate URLs

	Create a deduplication script to merge URLs from all sources:

	```python
	# discovery/url_deduplication.py
	from pyspark.sql.functions import col, count, first

	# Read all source tables
	localview = spark.read.format("delta").load("bronze/localview_videos")
	meetingbank = spark.read.format("delta").load("bronze/meetingbank_meetings")
	city_scrapers = spark.read.format("delta").load("bronze/city_scrapers_urls")

	# Deduplicate by URL
	unique_urls = (
	localview.select("url", "platform", "municipality", "state")
	.union(meetingbank.select("url", "platform", "municipality", "state"))
	.union(city_scrapers.select("url", "platform", "municipality", "state"))
	.dropDuplicates(["url"])
	)

	print(f"Total unique URLs: {unique_urls.count()}")
	```

	### 3. Priority Scraping

	Use LocalView data to prioritize which municipalities to scrape first:

	```sql
	-- Find municipalities with the most videos
	SELECT municipality, state, COUNT(*) as video_count
	FROM bronze.localview_videos
	GROUP BY municipality, state
	ORDER BY video_count DESC
	LIMIT 100
	```

	---

	## Documentation Links

	- Harvard Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
	- LocalView Research Paper: Search for "LocalView municipal meetings dataset" on Google Scholar
	- Harvard Mellon Urbanism Initiative: https://www.gsd.harvard.edu/project/localview/

	---

	## Expected Timeline

	\| Step \| Time Required \| Priority \|
	\|------\|---------------\|----------\|
	\| Download files from Harvard Dataverse \| 5-10 min \| 🔥 HIGH \|
	\| Run ingestion script \| 2-5 min \| 🔥 HIGH \|
	\| Verify data quality \| 5 min \| 🟡 MEDIUM \|
	\| Deduplication with other sources \| 10 min \| 🟡 MEDIUM \|

	Total time: ~30 minutes for complete integration

	---

	## Questions?

	If you encounter issues:

	1. Check the dataset documentation on Harvard Dataverse
	2. Look at example data in the first few rows
	3. Update column mappings in the script accordingly
	4. Run with `--sample` flag first to test: `python discovery/localview_ingestion.py --sample`