open-navigator / docs /BULK_VS_API.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

Bulk Downloads vs API: Which to Use?

TL;DR

Use Bulk Downloads for:

  • βœ… Historical analysis (analyzing past sessions)
  • βœ… Map generation (need all states at once)
  • βœ… Research projects (large datasets)
  • βœ… Offline processing
  • βœ… Multi-issue tracking across all states

Use API for:

  • βœ… Real-time bill status (same-day updates)
  • βœ… Search by specific keywords
  • βœ… Individual bill lookups
  • βœ… Automated alerts for bill changes

Comparison Table

Feature Bulk Download API
Speed (50 states) ⚑ 5-10 minutes 🐌 2-4 hours
API Key Required ❌ No βœ… Yes
Rate Limits ❌ None ⚠️ 50K/month
Internet Required Download once Always
Data Freshness Monthly updates Real-time
Bill Text βœ… Full text (JSON) βœ… Via API
Complete Sessions βœ… All bills Paginated
Cost πŸ’° Free πŸ’° Free (50K limit)
Redistribution βœ… Allowed ⚠️ Varies by state

Real-World Example

Task: Create fluoridation legislation map for all 50 states (2024)

Method 1: Bulk Download

# Download all 50 states
python scripts/bulk_legislative_download.py --year 2024 --format csv --merge

# Time: ~5 minutes
# API calls: 0
# Result: 1 CSV file with ALL bills

Result: One 500MB file with ~100,000 bills from all states

Method 2: API

# Search each state individually
python scripts/legislative_tracker.py --issue fluoridation --year 2024

# Time: ~2-4 hours
# API calls: ~10,000 (search + pagination)
# Result: Filtered bills matching "fluoridation"

Result: Filtered dataset with ~500 matching bills


When API is Better

Use Case 1: Real-Time Bill Tracking

Need: Alert when a specific bill status changes

# API can check latest status
async def check_bill_status(bill_id):
    response = await client.get(f"{base_url}/bills/{bill_id}")
    return response.json()['latest_action']

# Bulk: Would need to wait for next monthly dump

Use Case 2: Keyword Search

Need: Find all bills mentioning "oral health"

# API can search full text
params = {"q": "oral health", "jurisdiction": "AL"}
response = await client.get(f"{base_url}/bills", params=params)

# Bulk: Would need to download all bills, then search locally

Use Case 3: Single Bill Lookup

Need: Get details for one specific bill

# API is instant
response = await client.get(f"{base_url}/bills/AL/2024/HB123")

# Bulk: Download entire session just for one bill

When Bulk Downloads are Better

Use Case 1: All-State Analysis

Need: Map legislation across all 50 states

API Approach:

# 50 states Γ— 100 requests per state = 5,000 API calls
# Time: ~2 hours (with rate limiting)
# Risk: Hit API quota limit

Bulk Approach:

# Download all 50 state CSV files
# Time: ~5 minutes
# API calls: 0
# No quota concerns

Winner: Bulk (50x faster)

Use Case 2: Historical Trends

Need: Analyze fluoridation bills from 2010-2024

API Approach:

# 50 states Γ— 15 years Γ— 100 requests = 75,000 API calls
# Time: Would exceed free tier quota
# Cost: Need paid plan

Bulk Approach:

# Download 50 states Γ— 15 years = 750 CSV files
# Time: ~30 minutes
# Cost: Free, no limits

Winner: Bulk (only viable option)

Use Case 3: Offline Processing

Need: Process data without internet

API Approach:

# Must cache all API responses locally
# Complex caching logic needed
# Cache invalidation issues

Bulk Approach:

# Download once, process forever
# No internet needed after download
# Simple file-based workflow

Winner: Bulk (simpler)


Hybrid Approach (Best of Both Worlds)

Strategy: Bulk for foundation, API for updates

# 1. Download complete 2024 session (bulk)
!python scripts/bulk_legislative_download.py --year 2024 --merge

# 2. Load bulk data
df = pd.read_csv('data/cache/legislation_bulk/all_states_2024.csv')
print(f"Loaded {len(df)} bills from bulk download")

# 3. Use API for recent updates (last 7 days)
from datetime import datetime, timedelta
recent_cutoff = datetime.now() - timedelta(days=7)

# API search for bills updated in last week
async def get_recent_updates():
    params = {
        "updated_since": recent_cutoff.isoformat(),
        "jurisdiction": "all"
    }
    return await api_client.get("/bills", params=params)

recent = await get_recent_updates()

# 4. Merge bulk + recent updates
combined = pd.concat([df, recent])

Benefits:

  • Complete historical data (bulk)
  • Real-time updates (API)
  • Minimal API calls (only recent changes)

Recommendations by Project Type

Academic Research

β†’ Use Bulk Downloads

  • Need complete datasets
  • Historical analysis
  • No real-time requirements
  • May publish/redistribute

News/Journalism

β†’ Use API

  • Need latest bill status
  • Breaking news coverage
  • Specific bill tracking
  • Real-time alerts

Advocacy Campaigns

β†’ Use Hybrid

  • Bulk for initial analysis
  • API for monitoring active bills
  • Alerts when bills advance
  • Historical context + real-time

Government Dashboards

β†’ Use Hybrid

  • Bulk for historical trends
  • API for current session
  • Daily/weekly refresh
  • Public redistribution

Cost Analysis

Free Tier Limits

API:

  • 50,000 requests/month free
  • ~100 bills per request (pagination)
  • = ~5M bill records/month max

Bulk:

  • Unlimited downloads
  • ~100K bills per download
  • = Unlimited bill records/month

Time to Download All States (2024)

API (50 states):

50 states Γ— 100 API calls = 5,000 requests
5,000 requests Γ— 0.5s rate limit = 2,500 seconds = ~42 minutes
(Not including processing time)

Bulk (50 states):

50 CSV downloads Γ— 5s each = 250 seconds = ~4 minutes
(Includes all data, no processing needed)

Time Saved: ~38 minutes (10x faster)

Data Completeness

API:

  • Must paginate through all results
  • Risk of missing data if pagination fails
  • Requires careful error handling

Bulk:

  • Complete session in one file
  • Guaranteed completeness
  • No pagination errors

PostgreSQL Dump Option

For power users:

# Download complete Open States database
python scripts/bulk_legislative_download.py --postgres --month 2026-04

# Restore to local PostgreSQL
pg_restore -d openstates 2026-04-public.pgdump

# Now use SQL for analysis!
psql openstates -c "
  SELECT state, COUNT(*) as bill_count
  FROM bills
  WHERE session_year = 2024
  GROUP BY state
  ORDER BY bill_count DESC;
"

Benefits:

  • Complete database with relationships
  • SQL queries for complex analysis
  • No need for Python/pandas
  • Can use PostgreSQL extensions
  • Best for large-scale research

Drawbacks:

  • Large file size (~5GB compressed)
  • Requires PostgreSQL installation
  • More complex setup

Final Recommendation

Default choice: Bulk Downloads

Reasons:

  1. Faster (10x speed improvement)
  2. No API key setup
  3. No rate limits
  4. Work offline
  5. Complete sessions guaranteed

Switch to API when:

  • Need real-time status
  • Tracking specific bills
  • Keyword search required
  • Small subset of data

Use Both when:

  • Initial bulk download
  • Periodic API updates
  • Best of both worlds