open-navigator / docs /BULK_VS_API.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# Bulk Downloads vs API: Which to Use?
## TL;DR
**Use Bulk Downloads** for:
- ✅ Historical analysis (analyzing past sessions)
- ✅ Map generation (need all states at once)
- ✅ Research projects (large datasets)
- ✅ Offline processing
- ✅ Multi-issue tracking across all states
**Use API** for:
- ✅ Real-time bill status (same-day updates)
- ✅ Search by specific keywords
- ✅ Individual bill lookups
- ✅ Automated alerts for bill changes
---
## Comparison Table
| Feature | Bulk Download | API |
|---------|--------------|-----|
| **Speed (50 states)** | ⚡ 5-10 minutes | 🐌 2-4 hours |
| **API Key Required** | ❌ No | ✅ Yes |
| **Rate Limits** | ❌ None | ⚠️ 50K/month |
| **Internet Required** | Download once | Always |
| **Data Freshness** | Monthly updates | Real-time |
| **Bill Text** | ✅ Full text (JSON) | ✅ Via API |
| **Complete Sessions** | ✅ All bills | Paginated |
| **Cost** | 💰 Free | 💰 Free (50K limit) |
| **Redistribution** | ✅ Allowed | ⚠️ Varies by state |
---
## Real-World Example
### Task: Create fluoridation legislation map for all 50 states (2024)
#### Method 1: Bulk Download
```bash
# Download all 50 states
python scripts/bulk_legislative_download.py --year 2024 --format csv --merge
# Time: ~5 minutes
# API calls: 0
# Result: 1 CSV file with ALL bills
```
**Result:** One 500MB file with ~100,000 bills from all states
#### Method 2: API
```bash
# Search each state individually
python scripts/legislative_tracker.py --issue fluoridation --year 2024
# Time: ~2-4 hours
# API calls: ~10,000 (search + pagination)
# Result: Filtered bills matching "fluoridation"
```
**Result:** Filtered dataset with ~500 matching bills
---
## When API is Better
### Use Case 1: Real-Time Bill Tracking
**Need:** Alert when a specific bill status changes
```python
# API can check latest status
async def check_bill_status(bill_id):
response = await client.get(f"{base_url}/bills/{bill_id}")
return response.json()['latest_action']
# Bulk: Would need to wait for next monthly dump
```
### Use Case 2: Keyword Search
**Need:** Find all bills mentioning "oral health"
```python
# API can search full text
params = {"q": "oral health", "jurisdiction": "AL"}
response = await client.get(f"{base_url}/bills", params=params)
# Bulk: Would need to download all bills, then search locally
```
### Use Case 3: Single Bill Lookup
**Need:** Get details for one specific bill
```python
# API is instant
response = await client.get(f"{base_url}/bills/AL/2024/HB123")
# Bulk: Download entire session just for one bill
```
---
## When Bulk Downloads are Better
### Use Case 1: All-State Analysis
**Need:** Map legislation across all 50 states
**API Approach:**
```python
# 50 states × 100 requests per state = 5,000 API calls
# Time: ~2 hours (with rate limiting)
# Risk: Hit API quota limit
```
**Bulk Approach:**
```python
# Download all 50 state CSV files
# Time: ~5 minutes
# API calls: 0
# No quota concerns
```
**Winner:** Bulk (50x faster)
### Use Case 2: Historical Trends
**Need:** Analyze fluoridation bills from 2010-2024
**API Approach:**
```python
# 50 states × 15 years × 100 requests = 75,000 API calls
# Time: Would exceed free tier quota
# Cost: Need paid plan
```
**Bulk Approach:**
```python
# Download 50 states × 15 years = 750 CSV files
# Time: ~30 minutes
# Cost: Free, no limits
```
**Winner:** Bulk (only viable option)
### Use Case 3: Offline Processing
**Need:** Process data without internet
**API Approach:**
```python
# Must cache all API responses locally
# Complex caching logic needed
# Cache invalidation issues
```
**Bulk Approach:**
```python
# Download once, process forever
# No internet needed after download
# Simple file-based workflow
```
**Winner:** Bulk (simpler)
---
## Hybrid Approach (Best of Both Worlds)
### Strategy: Bulk for foundation, API for updates
```python
# 1. Download complete 2024 session (bulk)
!python scripts/bulk_legislative_download.py --year 2024 --merge
# 2. Load bulk data
df = pd.read_csv('data/cache/legislation_bulk/all_states_2024.csv')
print(f"Loaded {len(df)} bills from bulk download")
# 3. Use API for recent updates (last 7 days)
from datetime import datetime, timedelta
recent_cutoff = datetime.now() - timedelta(days=7)
# API search for bills updated in last week
async def get_recent_updates():
params = {
"updated_since": recent_cutoff.isoformat(),
"jurisdiction": "all"
}
return await api_client.get("/bills", params=params)
recent = await get_recent_updates()
# 4. Merge bulk + recent updates
combined = pd.concat([df, recent])
```
**Benefits:**
- Complete historical data (bulk)
- Real-time updates (API)
- Minimal API calls (only recent changes)
---
## Recommendations by Project Type
### Academic Research
**Use Bulk Downloads**
- Need complete datasets
- Historical analysis
- No real-time requirements
- May publish/redistribute
### News/Journalism
**Use API**
- Need latest bill status
- Breaking news coverage
- Specific bill tracking
- Real-time alerts
### Advocacy Campaigns
**Use Hybrid**
- Bulk for initial analysis
- API for monitoring active bills
- Alerts when bills advance
- Historical context + real-time
### Government Dashboards
**Use Hybrid**
- Bulk for historical trends
- API for current session
- Daily/weekly refresh
- Public redistribution
---
## Cost Analysis
### Free Tier Limits
**API:**
- 50,000 requests/month free
- ~100 bills per request (pagination)
- = ~5M bill records/month max
**Bulk:**
- Unlimited downloads
- ~100K bills per download
- = Unlimited bill records/month
### Time to Download All States (2024)
**API (50 states):**
```
50 states × 100 API calls = 5,000 requests
5,000 requests × 0.5s rate limit = 2,500 seconds = ~42 minutes
(Not including processing time)
```
**Bulk (50 states):**
```
50 CSV downloads × 5s each = 250 seconds = ~4 minutes
(Includes all data, no processing needed)
```
**Time Saved:** ~38 minutes (10x faster)
### Data Completeness
**API:**
- Must paginate through all results
- Risk of missing data if pagination fails
- Requires careful error handling
**Bulk:**
- Complete session in one file
- Guaranteed completeness
- No pagination errors
---
## PostgreSQL Dump Option
**For power users:**
```bash
# Download complete Open States database
python scripts/bulk_legislative_download.py --postgres --month 2026-04
# Restore to local PostgreSQL
pg_restore -d openstates 2026-04-public.pgdump
# Now use SQL for analysis!
psql openstates -c "
SELECT state, COUNT(*) as bill_count
FROM bills
WHERE session_year = 2024
GROUP BY state
ORDER BY bill_count DESC;
"
```
**Benefits:**
- Complete database with relationships
- SQL queries for complex analysis
- No need for Python/pandas
- Can use PostgreSQL extensions
- Best for large-scale research
**Drawbacks:**
- Large file size (~5GB compressed)
- Requires PostgreSQL installation
- More complex setup
---
## Final Recommendation
**Default choice: Bulk Downloads**
Reasons:
1. Faster (10x speed improvement)
2. No API key setup
3. No rate limits
4. Work offline
5. Complete sessions guaranteed
**Switch to API when:**
- Need real-time status
- Tracking specific bills
- Keyword search required
- Small subset of data
**Use Both when:**
- Initial bulk download
- Periodic API updates
- Best of both worlds