Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 7,592 Bytes
896453f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | # π LocalView Integration Guide
## Overview
**LocalView** is a Harvard University dataset containing **1,000-10,000 municipality URLs** with meeting videos and transcripts. It's the **largest known database of municipal meeting video archives**.
**Challenge**: The Harvard Dataverse requires JavaScript and may have CAPTCHA verification, so we need to download the files manually.
---
## Step-by-Step Download Instructions
### 1. Visit the Harvard Dataverse Website
**URL**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
### 2. Navigate to the Files Section
Once the page loads:
1. Scroll down to the **"Files"** section
2. Look for downloadable CSV/TAB files with names like:
- `municipalities.csv` or `municipalities.tab`
- `meetings.csv` or `meetings.tab`
- `videos.csv` or `videos.tab`
- Or similar naming patterns
### 3. Download the Files
Click the **"Download"** button for each file:
- Download all CSV/TAB files related to municipalities, meetings, and videos
- Save them to: `/home/developer/projects/open-navigator/data/cache/localview/`
**Expected files** (names may vary):
```
data/cache/localview/
βββ municipalities.csv # List of municipalities with URLs
βββ meetings.csv # Meeting metadata
βββ videos.csv # Video URLs and metadata
βββ README.txt # Dataset documentation (if available)
```
### 4. Expected Data Structure
The LocalView dataset typically includes:
**Municipalities file** (municipalities.csv):
- `municipality_name` - City/town name
- `state` - Two-letter state code
- `county` - County name
- `population` - Population count
- `website_url` - Official government website
- `meeting_page_url` - Link to meetings/agendas page
- `video_archive_url` - Link to video archive
**Meetings file** (meetings.csv):
- `meeting_id` - Unique identifier
- `municipality_name` - City/town name
- `meeting_date` - Date of meeting
- `meeting_type` - Type (Council, Planning, etc.)
- `video_url` - Direct link to video
- `transcript_available` - Boolean flag
- `transcript_url` - Link to transcript (if available)
**Videos file** (videos.csv):
- `video_id` - Unique identifier
- `video_url` - Direct video link
- `platform` - Platform (YouTube, Granicus, etc.)
- `duration_minutes` - Video length
- `has_captions` - Caption availability
---
## Integration Script Usage
### After Downloading Files
Once you've downloaded the files to `data/cache/localview/`, run:
```bash
cd /home/developer/projects/open-navigator
source venv/bin/activate
# Run the ingestion script
python discovery/localview_ingestion.py
```
### What the Script Does
1. **Reads downloaded CSV files** from cache directory
2. **Parses municipality data** - Names, states, URLs
3. **Extracts video URLs** - Direct links to meeting videos
4. **Identifies platforms** - YouTube, Granicus, Vimeo, Archive.org
5. **Writes to Bronze layer** - `bronze/localview_municipalities` and `bronze/localview_videos`
### Expected Output
```
[INFO] Loading LocalView data from cache...
[INFO] Found 1,247 municipalities
[INFO] Found 8,453 meeting videos
[INFO] Platforms detected:
- YouTube: 3,421 videos
- Granicus: 2,876 videos
- Vimeo: 1,234 videos
- Other: 922 videos
[SUCCESS] β Written 1,247 municipalities to bronze/localview_municipalities
[SUCCESS] β Written 8,453 videos to bronze/localview_videos
```
---
## Alternative: API Access (If Available)
**Check if LocalView provides API access:**
Some Harvard Dataverse datasets offer API access. Try:
```bash
# Check for API availability
curl -I "https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/NJTBEM"
```
If successful, we can update the script to use the API instead of manual download.
---
## Troubleshooting
### Problem: Can't Find CSV Files
**Solution**: The files might be in TAB format. The ingestion script handles both CSV and TAB files automatically.
### Problem: Files Have Different Names
**Solution**: Edit the `EXPECTED_FILES` dictionary in `discovery/localview_ingestion.py` to match the actual filenames.
### Problem: Data Format is Different
**Solution**: Check the README.txt or dataset documentation on Harvard Dataverse. Update the column mappings in the script to match.
### Problem: CAPTCHA Blocks Download
**Solution**:
1. Use a web browser (not curl/wget)
2. Complete the CAPTCHA verification
3. Download files manually through the browser
4. Save to `data/cache/localview/`
---
## Data Quality & Coverage
### Expected Coverage
Based on the LocalView research paper:
- **1,000-1,500 municipalities** with verified meeting archives
- **5,000-10,000 meeting videos** with URLs
- **Coverage**: Major cities + medium-sized municipalities
- **Time range**: 2015-2024 (approximately)
- **Focus states**: CA, MA, TX, NY, FL, IL (highest coverage)
### Quality Indicators
- β
**Academic validation** - Harvard research project
- β
**Human verification** - URLs manually verified
- β
**Transcript availability** - Many include automated transcripts
- β
**Continuous updates** - Dataset may be updated periodically
---
## Next Steps After Integration
### 1. Combine with Other Sources
```bash
# After running LocalView ingestion
python discovery/meetingbank_ingestion.py # 1,366 meetings
python discovery/city_scrapers_urls.py # 100-500 agencies
python discovery/openstates_sources.py # 50+ legislatures
# Total: 7,000-12,000 verified URLs!
```
### 2. Deduplicate URLs
Create a deduplication script to merge URLs from all sources:
```python
# discovery/url_deduplication.py
from pyspark.sql.functions import col, count, first
# Read all source tables
localview = spark.read.format("delta").load("bronze/localview_videos")
meetingbank = spark.read.format("delta").load("bronze/meetingbank_meetings")
city_scrapers = spark.read.format("delta").load("bronze/city_scrapers_urls")
# Deduplicate by URL
unique_urls = (
localview.select("url", "platform", "municipality", "state")
.union(meetingbank.select("url", "platform", "municipality", "state"))
.union(city_scrapers.select("url", "platform", "municipality", "state"))
.dropDuplicates(["url"])
)
print(f"Total unique URLs: {unique_urls.count()}")
```
### 3. Priority Scraping
Use LocalView data to prioritize which municipalities to scrape first:
```sql
-- Find municipalities with the most videos
SELECT municipality, state, COUNT(*) as video_count
FROM bronze.localview_videos
GROUP BY municipality, state
ORDER BY video_count DESC
LIMIT 100
```
---
## Documentation Links
- **Harvard Dataverse**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
- **LocalView Research Paper**: Search for "LocalView municipal meetings dataset" on Google Scholar
- **Harvard Mellon Urbanism Initiative**: https://www.gsd.harvard.edu/project/localview/
---
## Expected Timeline
| Step | Time Required | Priority |
|------|---------------|----------|
| Download files from Harvard Dataverse | 5-10 min | π₯ HIGH |
| Run ingestion script | 2-5 min | π₯ HIGH |
| Verify data quality | 5 min | π‘ MEDIUM |
| Deduplication with other sources | 10 min | π‘ MEDIUM |
**Total time**: ~30 minutes for complete integration
---
## Questions?
If you encounter issues:
1. Check the dataset documentation on Harvard Dataverse
2. Look at example data in the first few rows
3. Update column mappings in the script accordingly
4. Run with `--sample` flag first to test: `python discovery/localview_ingestion.py --sample`
|