Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # π LocalView Integration Guide | |
| ## Overview | |
| **LocalView** is a Harvard University dataset containing **1,000-10,000 municipality URLs** with meeting videos and transcripts. It's the **largest known database of municipal meeting video archives**. | |
| **Challenge**: The Harvard Dataverse requires JavaScript and may have CAPTCHA verification, so we need to download the files manually. | |
| --- | |
| ## Step-by-Step Download Instructions | |
| ### 1. Visit the Harvard Dataverse Website | |
| **URL**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM | |
| ### 2. Navigate to the Files Section | |
| Once the page loads: | |
| 1. Scroll down to the **"Files"** section | |
| 2. Look for downloadable CSV/TAB files with names like: | |
| - `municipalities.csv` or `municipalities.tab` | |
| - `meetings.csv` or `meetings.tab` | |
| - `videos.csv` or `videos.tab` | |
| - Or similar naming patterns | |
| ### 3. Download the Files | |
| Click the **"Download"** button for each file: | |
| - Download all CSV/TAB files related to municipalities, meetings, and videos | |
| - Save them to: `/home/developer/projects/open-navigator/data/cache/localview/` | |
| **Expected files** (names may vary): | |
| ``` | |
| data/cache/localview/ | |
| βββ municipalities.csv # List of municipalities with URLs | |
| βββ meetings.csv # Meeting metadata | |
| βββ videos.csv # Video URLs and metadata | |
| βββ README.txt # Dataset documentation (if available) | |
| ``` | |
| ### 4. Expected Data Structure | |
| The LocalView dataset typically includes: | |
| **Municipalities file** (municipalities.csv): | |
| - `municipality_name` - City/town name | |
| - `state` - Two-letter state code | |
| - `county` - County name | |
| - `population` - Population count | |
| - `website_url` - Official government website | |
| - `meeting_page_url` - Link to meetings/agendas page | |
| - `video_archive_url` - Link to video archive | |
| **Meetings file** (meetings.csv): | |
| - `meeting_id` - Unique identifier | |
| - `municipality_name` - City/town name | |
| - `meeting_date` - Date of meeting | |
| - `meeting_type` - Type (Council, Planning, etc.) | |
| - `video_url` - Direct link to video | |
| - `transcript_available` - Boolean flag | |
| - `transcript_url` - Link to transcript (if available) | |
| **Videos file** (videos.csv): | |
| - `video_id` - Unique identifier | |
| - `video_url` - Direct video link | |
| - `platform` - Platform (YouTube, Granicus, etc.) | |
| - `duration_minutes` - Video length | |
| - `has_captions` - Caption availability | |
| --- | |
| ## Integration Script Usage | |
| ### After Downloading Files | |
| Once you've downloaded the files to `data/cache/localview/`, run: | |
| ```bash | |
| cd /home/developer/projects/open-navigator | |
| source venv/bin/activate | |
| # Run the ingestion script | |
| python discovery/localview_ingestion.py | |
| ``` | |
| ### What the Script Does | |
| 1. **Reads downloaded CSV files** from cache directory | |
| 2. **Parses municipality data** - Names, states, URLs | |
| 3. **Extracts video URLs** - Direct links to meeting videos | |
| 4. **Identifies platforms** - YouTube, Granicus, Vimeo, Archive.org | |
| 5. **Writes to Bronze layer** - `bronze/localview_municipalities` and `bronze/localview_videos` | |
| ### Expected Output | |
| ``` | |
| [INFO] Loading LocalView data from cache... | |
| [INFO] Found 1,247 municipalities | |
| [INFO] Found 8,453 meeting videos | |
| [INFO] Platforms detected: | |
| - YouTube: 3,421 videos | |
| - Granicus: 2,876 videos | |
| - Vimeo: 1,234 videos | |
| - Other: 922 videos | |
| [SUCCESS] β Written 1,247 municipalities to bronze/localview_municipalities | |
| [SUCCESS] β Written 8,453 videos to bronze/localview_videos | |
| ``` | |
| --- | |
| ## Alternative: API Access (If Available) | |
| **Check if LocalView provides API access:** | |
| Some Harvard Dataverse datasets offer API access. Try: | |
| ```bash | |
| # Check for API availability | |
| curl -I "https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/NJTBEM" | |
| ``` | |
| If successful, we can update the script to use the API instead of manual download. | |
| --- | |
| ## Troubleshooting | |
| ### Problem: Can't Find CSV Files | |
| **Solution**: The files might be in TAB format. The ingestion script handles both CSV and TAB files automatically. | |
| ### Problem: Files Have Different Names | |
| **Solution**: Edit the `EXPECTED_FILES` dictionary in `discovery/localview_ingestion.py` to match the actual filenames. | |
| ### Problem: Data Format is Different | |
| **Solution**: Check the README.txt or dataset documentation on Harvard Dataverse. Update the column mappings in the script to match. | |
| ### Problem: CAPTCHA Blocks Download | |
| **Solution**: | |
| 1. Use a web browser (not curl/wget) | |
| 2. Complete the CAPTCHA verification | |
| 3. Download files manually through the browser | |
| 4. Save to `data/cache/localview/` | |
| --- | |
| ## Data Quality & Coverage | |
| ### Expected Coverage | |
| Based on the LocalView research paper: | |
| - **1,000-1,500 municipalities** with verified meeting archives | |
| - **5,000-10,000 meeting videos** with URLs | |
| - **Coverage**: Major cities + medium-sized municipalities | |
| - **Time range**: 2015-2024 (approximately) | |
| - **Focus states**: CA, MA, TX, NY, FL, IL (highest coverage) | |
| ### Quality Indicators | |
| - β **Academic validation** - Harvard research project | |
| - β **Human verification** - URLs manually verified | |
| - β **Transcript availability** - Many include automated transcripts | |
| - β **Continuous updates** - Dataset may be updated periodically | |
| --- | |
| ## Next Steps After Integration | |
| ### 1. Combine with Other Sources | |
| ```bash | |
| # After running LocalView ingestion | |
| python discovery/meetingbank_ingestion.py # 1,366 meetings | |
| python discovery/city_scrapers_urls.py # 100-500 agencies | |
| python discovery/openstates_sources.py # 50+ legislatures | |
| # Total: 7,000-12,000 verified URLs! | |
| ``` | |
| ### 2. Deduplicate URLs | |
| Create a deduplication script to merge URLs from all sources: | |
| ```python | |
| # discovery/url_deduplication.py | |
| from pyspark.sql.functions import col, count, first | |
| # Read all source tables | |
| localview = spark.read.format("delta").load("bronze/localview_videos") | |
| meetingbank = spark.read.format("delta").load("bronze/meetingbank_meetings") | |
| city_scrapers = spark.read.format("delta").load("bronze/city_scrapers_urls") | |
| # Deduplicate by URL | |
| unique_urls = ( | |
| localview.select("url", "platform", "municipality", "state") | |
| .union(meetingbank.select("url", "platform", "municipality", "state")) | |
| .union(city_scrapers.select("url", "platform", "municipality", "state")) | |
| .dropDuplicates(["url"]) | |
| ) | |
| print(f"Total unique URLs: {unique_urls.count()}") | |
| ``` | |
| ### 3. Priority Scraping | |
| Use LocalView data to prioritize which municipalities to scrape first: | |
| ```sql | |
| -- Find municipalities with the most videos | |
| SELECT municipality, state, COUNT(*) as video_count | |
| FROM bronze.localview_videos | |
| GROUP BY municipality, state | |
| ORDER BY video_count DESC | |
| LIMIT 100 | |
| ``` | |
| --- | |
| ## Documentation Links | |
| - **Harvard Dataverse**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM | |
| - **LocalView Research Paper**: Search for "LocalView municipal meetings dataset" on Google Scholar | |
| - **Harvard Mellon Urbanism Initiative**: https://www.gsd.harvard.edu/project/localview/ | |
| --- | |
| ## Expected Timeline | |
| | Step | Time Required | Priority | | |
| |------|---------------|----------| | |
| | Download files from Harvard Dataverse | 5-10 min | π₯ HIGH | | |
| | Run ingestion script | 2-5 min | π₯ HIGH | | |
| | Verify data quality | 5 min | π‘ MEDIUM | | |
| | Deduplication with other sources | 10 min | π‘ MEDIUM | | |
| **Total time**: ~30 minutes for complete integration | |
| --- | |
| ## Questions? | |
| If you encounter issues: | |
| 1. Check the dataset documentation on Harvard Dataverse | |
| 2. Look at example data in the first few rows | |
| 3. Update column mappings in the script accordingly | |
| 4. Run with `--sample` flag first to test: `python discovery/localview_ingestion.py --sample` | |