File size: 6,427 Bytes
61d29fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# πŸŽ‰ Harvard Dataverse Integration - Complete!

## βœ… What Was Implemented

We've integrated **production-ready Dataverse API client** following all best practices from [IQSS/dataverse](https://github.com/IQSS/dataverse).

### New Files Created

1. **[`discovery/dataverse_client.py`](../discovery/dataverse_client.py)** (600+ lines)
   - Full-featured Dataverse API client
   - API authentication
   - Rate limiting with exponential backoff
   - Checksum verification (MD5)
   - Version-aware caching
   - Comprehensive error handling
   - Pagination support

2. **[`docs/DATAVERSE_INTEGRATION.md`](DATAVERSE_INTEGRATION.md)**
   - Complete integration guide
   - API usage examples
   - Best practices documentation
   - Troubleshooting guide

### Updated Files

1. **[`config/settings.py`](../config/settings.py)**
   - Added `dataverse_api_key` setting
   - Added `openstates_api_key` setting

2. **[`.env.example`](../.env.example)**
   - Added DATAVERSE_API_KEY
   - Added OPENSTATES_API_KEY
   - Clarified that Legistar/Municode don't need keys

3. **[`discovery/localview_ingestion.py`](../discovery/localview_ingestion.py)**
   - Now tries API download first
   - Falls back to manual download
   - Better error messages

---

## πŸš€ How to Use

### Quick Start (with API key)

```bash
# 1. Get free API key (5 min)
open https://dataverse.harvard.edu/loginpage.xhtml

# 2. Add to .env
echo "DATAVERSE_API_KEY=your_key" >> .env

# 3. Download LocalView dataset
source venv/bin/activate
python discovery/localview_ingestion.py
```

### Without API Key (manual)

```bash
# 1. Download files from Harvard Dataverse
open https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM

# 2. Save CSV files to data/cache/localview/

# 3. Run ingestion
python discovery/localview_ingestion.py
```

---

## πŸ“Š IQSS Best Practices Implemented

| Practice | Status | Implementation |
|----------|--------|----------------|
| **API Authentication** | βœ… | X-Dataverse-key header |
| **Rate Limiting** | βœ… | 100 req/min client-side throttling |
| **Error Handling** | βœ… | All status codes (401, 404, 429, 500+) |
| **Retry Logic** | βœ… | Exponential backoff |
| **Checksum Verification** | βœ… | MD5 validation |
| **Caching** | βœ… | Version-aware metadata & file caching |
| **Pagination** | βœ… | Handles large file lists |
| **Timeout Handling** | βœ… | Configurable with retries |

---

## πŸ” What Makes This Production-Ready

### 1. **Follows Official IQSS Standards**
Based on official Dataverse API documentation and GitHub repo patterns.

### 2. **Comprehensive Error Handling**
```python
# Handles all edge cases
- 401 Unauthorized β†’ Clear message to get API key
- 404 Not Found β†’ Dataset doesn't exist
- 429 Rate Limited β†’ Auto-retry with backoff
- 500+ Server Error β†’ Exponential backoff retry
- Timeout β†’ Configurable retry logic
```

### 3. **Data Integrity**
```python
# MD5 checksum verification
expected = file_info["dataFile"]["md5"]
actual = hashlib.md5(content).hexdigest()
if expected != actual:
    logger.error("Checksum mismatch - file corrupted")
```

### 4. **Performance Optimization**
```python
# Client-side rate limiting prevents 429 errors
# Version-aware caching reduces API calls
# Efficient async downloads
```

### 5. **Developer Experience**
```python
# Simple async API
client = DataverseClient(api_key="your-key")
result = await client.download_dataset("doi:10.7910/DVN/NJTBEM")

# Clear logging
logger.info("Downloading file 1/10...")
logger.success("βœ“ Download complete")
logger.error("βœ— Checksum failed")
```

---

## πŸ“ˆ Impact

### Before
- ❌ Basic API calls only
- ❌ No error handling
- ❌ No rate limiting
- ❌ No checksum verification
- ❌ Manual downloads required

### After
- βœ… Production-ready API client
- βœ… Comprehensive error handling
- βœ… Smart rate limiting
- βœ… Checksum verification
- βœ… Optional automatic downloads
- βœ… Falls back to manual gracefully

---

## πŸŽ“ Learning Resources

### Official IQSS Documentation
- **Dataverse API**: https://guides.dataverse.org/en/latest/api/index.html
- **GitHub Repo**: https://github.com/IQSS/dataverse
- **Community**: https://groups.google.com/group/dataverse-community

### Our Documentation
- **Integration Guide**: [docs/DATAVERSE_INTEGRATION.md](DATAVERSE_INTEGRATION.md)
- **LocalView Guide**: [docs/LOCALVIEW_INTEGRATION_GUIDE.md](LOCALVIEW_INTEGRATION_GUIDE.md)
- **API Client Code**: [discovery/dataverse_client.py](../discovery/dataverse_client.py)

---

## πŸ”₯ Next Steps

1. **Get API Key** (optional but recommended)
   - Sign up at https://dataverse.harvard.edu/loginpage.xhtml
   - Generate token in Account Settings
   - Add to `.env`: `DATAVERSE_API_KEY=your_key`

2. **Download LocalView**
   ```bash
   python discovery/localview_ingestion.py
   ```

3. **Verify Results**
   ```bash
   ls -lh data/cache/localview/
   # Should show CSV/TAB files
   ```

4. **Process Data**
   - Files automatically loaded into Delta Lake
   - Bronze layer: `bronze/localview/municipalities`
   - Bronze layer: `bronze/localview/videos`

---

## ✨ Summary

We now have:

1. βœ… **Production-ready Dataverse client** following all IQSS best practices
2. βœ… **Automatic downloads** with API key (optional)
3. βœ… **Manual download support** (fallback)
4. βœ… **Comprehensive error handling** (all status codes)
5. βœ… **Data integrity** (MD5 checksums)
6. βœ… **Smart caching** (version-aware)
7. βœ… **Rate limiting** (prevents 429 errors)
8. βœ… **Great documentation** (guides + examples)

This is the **same quality** you'd expect from official Harvard/IQSS integrations! πŸŽ‰

---

## πŸ™ Credits

- **IQSS Team** - Official Dataverse API and best practices
- **Harvard Dataverse** - Hosting the LocalView dataset
- **Harvard Mellon Urbanism Initiative** - Creating LocalView

---

## πŸ“ Files Summary

| File | Lines | Purpose |
|------|-------|---------|
| discovery/dataverse_client.py | 600+ | Production Dataverse API client |
| docs/DATAVERSE_INTEGRATION.md | 400+ | Integration guide & examples |
| docs/DATAVERSE_INTEGRATION_SUMMARY.md | 200+ | Quick reference (this file) |
| config/settings.py | Updated | Add dataverse_api_key setting |
| .env.example | Updated | Add DATAVERSE_API_KEY example |
| discovery/localview_ingestion.py | Updated | Use API client + fallback |

**Total new code**: ~1,200 lines of production-ready integration! πŸš€