Spaces:
Running on CPU Upgrade
Bulk Downloads vs API: Which to Use?
TL;DR
Use Bulk Downloads for:
- β Historical analysis (analyzing past sessions)
- β Map generation (need all states at once)
- β Research projects (large datasets)
- β Offline processing
- β Multi-issue tracking across all states
Use API for:
- β Real-time bill status (same-day updates)
- β Search by specific keywords
- β Individual bill lookups
- β Automated alerts for bill changes
Comparison Table
| Feature | Bulk Download | API |
|---|---|---|
| Speed (50 states) | β‘ 5-10 minutes | π 2-4 hours |
| API Key Required | β No | β Yes |
| Rate Limits | β None | β οΈ 50K/month |
| Internet Required | Download once | Always |
| Data Freshness | Monthly updates | Real-time |
| Bill Text | β Full text (JSON) | β Via API |
| Complete Sessions | β All bills | Paginated |
| Cost | π° Free | π° Free (50K limit) |
| Redistribution | β Allowed | β οΈ Varies by state |
Real-World Example
Task: Create fluoridation legislation map for all 50 states (2024)
Method 1: Bulk Download
# Download all 50 states
python scripts/bulk_legislative_download.py --year 2024 --format csv --merge
# Time: ~5 minutes
# API calls: 0
# Result: 1 CSV file with ALL bills
Result: One 500MB file with ~100,000 bills from all states
Method 2: API
# Search each state individually
python scripts/legislative_tracker.py --issue fluoridation --year 2024
# Time: ~2-4 hours
# API calls: ~10,000 (search + pagination)
# Result: Filtered bills matching "fluoridation"
Result: Filtered dataset with ~500 matching bills
When API is Better
Use Case 1: Real-Time Bill Tracking
Need: Alert when a specific bill status changes
# API can check latest status
async def check_bill_status(bill_id):
response = await client.get(f"{base_url}/bills/{bill_id}")
return response.json()['latest_action']
# Bulk: Would need to wait for next monthly dump
Use Case 2: Keyword Search
Need: Find all bills mentioning "oral health"
# API can search full text
params = {"q": "oral health", "jurisdiction": "AL"}
response = await client.get(f"{base_url}/bills", params=params)
# Bulk: Would need to download all bills, then search locally
Use Case 3: Single Bill Lookup
Need: Get details for one specific bill
# API is instant
response = await client.get(f"{base_url}/bills/AL/2024/HB123")
# Bulk: Download entire session just for one bill
When Bulk Downloads are Better
Use Case 1: All-State Analysis
Need: Map legislation across all 50 states
API Approach:
# 50 states Γ 100 requests per state = 5,000 API calls
# Time: ~2 hours (with rate limiting)
# Risk: Hit API quota limit
Bulk Approach:
# Download all 50 state CSV files
# Time: ~5 minutes
# API calls: 0
# No quota concerns
Winner: Bulk (50x faster)
Use Case 2: Historical Trends
Need: Analyze fluoridation bills from 2010-2024
API Approach:
# 50 states Γ 15 years Γ 100 requests = 75,000 API calls
# Time: Would exceed free tier quota
# Cost: Need paid plan
Bulk Approach:
# Download 50 states Γ 15 years = 750 CSV files
# Time: ~30 minutes
# Cost: Free, no limits
Winner: Bulk (only viable option)
Use Case 3: Offline Processing
Need: Process data without internet
API Approach:
# Must cache all API responses locally
# Complex caching logic needed
# Cache invalidation issues
Bulk Approach:
# Download once, process forever
# No internet needed after download
# Simple file-based workflow
Winner: Bulk (simpler)
Hybrid Approach (Best of Both Worlds)
Strategy: Bulk for foundation, API for updates
# 1. Download complete 2024 session (bulk)
!python scripts/bulk_legislative_download.py --year 2024 --merge
# 2. Load bulk data
df = pd.read_csv('data/cache/legislation_bulk/all_states_2024.csv')
print(f"Loaded {len(df)} bills from bulk download")
# 3. Use API for recent updates (last 7 days)
from datetime import datetime, timedelta
recent_cutoff = datetime.now() - timedelta(days=7)
# API search for bills updated in last week
async def get_recent_updates():
params = {
"updated_since": recent_cutoff.isoformat(),
"jurisdiction": "all"
}
return await api_client.get("/bills", params=params)
recent = await get_recent_updates()
# 4. Merge bulk + recent updates
combined = pd.concat([df, recent])
Benefits:
- Complete historical data (bulk)
- Real-time updates (API)
- Minimal API calls (only recent changes)
Recommendations by Project Type
Academic Research
β Use Bulk Downloads
- Need complete datasets
- Historical analysis
- No real-time requirements
- May publish/redistribute
News/Journalism
β Use API
- Need latest bill status
- Breaking news coverage
- Specific bill tracking
- Real-time alerts
Advocacy Campaigns
β Use Hybrid
- Bulk for initial analysis
- API for monitoring active bills
- Alerts when bills advance
- Historical context + real-time
Government Dashboards
β Use Hybrid
- Bulk for historical trends
- API for current session
- Daily/weekly refresh
- Public redistribution
Cost Analysis
Free Tier Limits
API:
- 50,000 requests/month free
- ~100 bills per request (pagination)
- = ~5M bill records/month max
Bulk:
- Unlimited downloads
- ~100K bills per download
- = Unlimited bill records/month
Time to Download All States (2024)
API (50 states):
50 states Γ 100 API calls = 5,000 requests
5,000 requests Γ 0.5s rate limit = 2,500 seconds = ~42 minutes
(Not including processing time)
Bulk (50 states):
50 CSV downloads Γ 5s each = 250 seconds = ~4 minutes
(Includes all data, no processing needed)
Time Saved: ~38 minutes (10x faster)
Data Completeness
API:
- Must paginate through all results
- Risk of missing data if pagination fails
- Requires careful error handling
Bulk:
- Complete session in one file
- Guaranteed completeness
- No pagination errors
PostgreSQL Dump Option
For power users:
# Download complete Open States database
python scripts/bulk_legislative_download.py --postgres --month 2026-04
# Restore to local PostgreSQL
pg_restore -d openstates 2026-04-public.pgdump
# Now use SQL for analysis!
psql openstates -c "
SELECT state, COUNT(*) as bill_count
FROM bills
WHERE session_year = 2024
GROUP BY state
ORDER BY bill_count DESC;
"
Benefits:
- Complete database with relationships
- SQL queries for complex analysis
- No need for Python/pandas
- Can use PostgreSQL extensions
- Best for large-scale research
Drawbacks:
- Large file size (~5GB compressed)
- Requires PostgreSQL installation
- More complex setup
Final Recommendation
Default choice: Bulk Downloads
Reasons:
- Faster (10x speed improvement)
- No API key setup
- No rate limits
- Work offline
- Complete sessions guaranteed
Switch to API when:
- Need real-time status
- Tracking specific bills
- Keyword search required
- Small subset of data
Use Both when:
- Initial bulk download
- Periodic API updates
- Best of both worlds