open-navigator / docs /LOCALVIEW_INTEGRATION_GUIDE.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# πŸ“š LocalView Integration Guide
## Overview
**LocalView** is a Harvard University dataset containing **1,000-10,000 municipality URLs** with meeting videos and transcripts. It's the **largest known database of municipal meeting video archives**.
**Challenge**: The Harvard Dataverse requires JavaScript and may have CAPTCHA verification, so we need to download the files manually.
---
## Step-by-Step Download Instructions
### 1. Visit the Harvard Dataverse Website
**URL**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
### 2. Navigate to the Files Section
Once the page loads:
1. Scroll down to the **"Files"** section
2. Look for downloadable CSV/TAB files with names like:
- `municipalities.csv` or `municipalities.tab`
- `meetings.csv` or `meetings.tab`
- `videos.csv` or `videos.tab`
- Or similar naming patterns
### 3. Download the Files
Click the **"Download"** button for each file:
- Download all CSV/TAB files related to municipalities, meetings, and videos
- Save them to: `/home/developer/projects/open-navigator/data/cache/localview/`
**Expected files** (names may vary):
```
data/cache/localview/
β”œβ”€β”€ municipalities.csv # List of municipalities with URLs
β”œβ”€β”€ meetings.csv # Meeting metadata
β”œβ”€β”€ videos.csv # Video URLs and metadata
└── README.txt # Dataset documentation (if available)
```
### 4. Expected Data Structure
The LocalView dataset typically includes:
**Municipalities file** (municipalities.csv):
- `municipality_name` - City/town name
- `state` - Two-letter state code
- `county` - County name
- `population` - Population count
- `website_url` - Official government website
- `meeting_page_url` - Link to meetings/agendas page
- `video_archive_url` - Link to video archive
**Meetings file** (meetings.csv):
- `meeting_id` - Unique identifier
- `municipality_name` - City/town name
- `meeting_date` - Date of meeting
- `meeting_type` - Type (Council, Planning, etc.)
- `video_url` - Direct link to video
- `transcript_available` - Boolean flag
- `transcript_url` - Link to transcript (if available)
**Videos file** (videos.csv):
- `video_id` - Unique identifier
- `video_url` - Direct video link
- `platform` - Platform (YouTube, Granicus, etc.)
- `duration_minutes` - Video length
- `has_captions` - Caption availability
---
## Integration Script Usage
### After Downloading Files
Once you've downloaded the files to `data/cache/localview/`, run:
```bash
cd /home/developer/projects/open-navigator
source venv/bin/activate
# Run the ingestion script
python discovery/localview_ingestion.py
```
### What the Script Does
1. **Reads downloaded CSV files** from cache directory
2. **Parses municipality data** - Names, states, URLs
3. **Extracts video URLs** - Direct links to meeting videos
4. **Identifies platforms** - YouTube, Granicus, Vimeo, Archive.org
5. **Writes to Bronze layer** - `bronze/localview_municipalities` and `bronze/localview_videos`
### Expected Output
```
[INFO] Loading LocalView data from cache...
[INFO] Found 1,247 municipalities
[INFO] Found 8,453 meeting videos
[INFO] Platforms detected:
- YouTube: 3,421 videos
- Granicus: 2,876 videos
- Vimeo: 1,234 videos
- Other: 922 videos
[SUCCESS] βœ“ Written 1,247 municipalities to bronze/localview_municipalities
[SUCCESS] βœ“ Written 8,453 videos to bronze/localview_videos
```
---
## Alternative: API Access (If Available)
**Check if LocalView provides API access:**
Some Harvard Dataverse datasets offer API access. Try:
```bash
# Check for API availability
curl -I "https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/NJTBEM"
```
If successful, we can update the script to use the API instead of manual download.
---
## Troubleshooting
### Problem: Can't Find CSV Files
**Solution**: The files might be in TAB format. The ingestion script handles both CSV and TAB files automatically.
### Problem: Files Have Different Names
**Solution**: Edit the `EXPECTED_FILES` dictionary in `discovery/localview_ingestion.py` to match the actual filenames.
### Problem: Data Format is Different
**Solution**: Check the README.txt or dataset documentation on Harvard Dataverse. Update the column mappings in the script to match.
### Problem: CAPTCHA Blocks Download
**Solution**:
1. Use a web browser (not curl/wget)
2. Complete the CAPTCHA verification
3. Download files manually through the browser
4. Save to `data/cache/localview/`
---
## Data Quality & Coverage
### Expected Coverage
Based on the LocalView research paper:
- **1,000-1,500 municipalities** with verified meeting archives
- **5,000-10,000 meeting videos** with URLs
- **Coverage**: Major cities + medium-sized municipalities
- **Time range**: 2015-2024 (approximately)
- **Focus states**: CA, MA, TX, NY, FL, IL (highest coverage)
### Quality Indicators
- βœ… **Academic validation** - Harvard research project
- βœ… **Human verification** - URLs manually verified
- βœ… **Transcript availability** - Many include automated transcripts
- βœ… **Continuous updates** - Dataset may be updated periodically
---
## Next Steps After Integration
### 1. Combine with Other Sources
```bash
# After running LocalView ingestion
python discovery/meetingbank_ingestion.py # 1,366 meetings
python discovery/city_scrapers_urls.py # 100-500 agencies
python discovery/openstates_sources.py # 50+ legislatures
# Total: 7,000-12,000 verified URLs!
```
### 2. Deduplicate URLs
Create a deduplication script to merge URLs from all sources:
```python
# discovery/url_deduplication.py
from pyspark.sql.functions import col, count, first
# Read all source tables
localview = spark.read.format("delta").load("bronze/localview_videos")
meetingbank = spark.read.format("delta").load("bronze/meetingbank_meetings")
city_scrapers = spark.read.format("delta").load("bronze/city_scrapers_urls")
# Deduplicate by URL
unique_urls = (
localview.select("url", "platform", "municipality", "state")
.union(meetingbank.select("url", "platform", "municipality", "state"))
.union(city_scrapers.select("url", "platform", "municipality", "state"))
.dropDuplicates(["url"])
)
print(f"Total unique URLs: {unique_urls.count()}")
```
### 3. Priority Scraping
Use LocalView data to prioritize which municipalities to scrape first:
```sql
-- Find municipalities with the most videos
SELECT municipality, state, COUNT(*) as video_count
FROM bronze.localview_videos
GROUP BY municipality, state
ORDER BY video_count DESC
LIMIT 100
```
---
## Documentation Links
- **Harvard Dataverse**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
- **LocalView Research Paper**: Search for "LocalView municipal meetings dataset" on Google Scholar
- **Harvard Mellon Urbanism Initiative**: https://www.gsd.harvard.edu/project/localview/
---
## Expected Timeline
| Step | Time Required | Priority |
|------|---------------|----------|
| Download files from Harvard Dataverse | 5-10 min | πŸ”₯ HIGH |
| Run ingestion script | 2-5 min | πŸ”₯ HIGH |
| Verify data quality | 5 min | 🟑 MEDIUM |
| Deduplication with other sources | 10 min | 🟑 MEDIUM |
**Total time**: ~30 minutes for complete integration
---
## Questions?
If you encounter issues:
1. Check the dataset documentation on Harvard Dataverse
2. Look at example data in the first few rows
3. Update column mappings in the script accordingly
4. Run with `--sample` flag first to test: `python discovery/localview_ingestion.py --sample`