Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /BULK_VS_API.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

7.41 kB

Bulk Downloads vs API: Which to Use?

TL;DR

Use Bulk Downloads for:

✅ Historical analysis (analyzing past sessions)
✅ Map generation (need all states at once)
✅ Research projects (large datasets)
✅ Offline processing
✅ Multi-issue tracking across all states

Use API for:

✅ Real-time bill status (same-day updates)
✅ Search by specific keywords
✅ Individual bill lookups
✅ Automated alerts for bill changes

Comparison Table

Feature	Bulk Download	API
Speed (50 states)	⚡ 5-10 minutes	🐌 2-4 hours
API Key Required	❌ No	✅ Yes
Rate Limits	❌ None	⚠️ 50K/month
Internet Required	Download once	Always
Data Freshness	Monthly updates	Real-time
Bill Text	✅ Full text (JSON)	✅ Via API
Complete Sessions	✅ All bills	Paginated
Cost	💰 Free	💰 Free (50K limit)
Redistribution	✅ Allowed	⚠️ Varies by state

Real-World Example

Task: Create fluoridation legislation map for all 50 states (2024)

Method 1: Bulk Download

# Download all 50 states
python scripts/bulk_legislative_download.py --year 2024 --format csv --merge

# Time: ~5 minutes
# API calls: 0
# Result: 1 CSV file with ALL bills

Result: One 500MB file with ~100,000 bills from all states

Method 2: API

# Search each state individually
python scripts/legislative_tracker.py --issue fluoridation --year 2024

# Time: ~2-4 hours
# API calls: ~10,000 (search + pagination)
# Result: Filtered bills matching "fluoridation"

Result: Filtered dataset with ~500 matching bills

When API is Better

Use Case 1: Real-Time Bill Tracking

Need: Alert when a specific bill status changes

# API can check latest status
async def check_bill_status(bill_id):
    response = await client.get(f"{base_url}/bills/{bill_id}")
    return response.json()['latest_action']

# Bulk: Would need to wait for next monthly dump

Use Case 2: Keyword Search

Need: Find all bills mentioning "oral health"

# API can search full text
params = {"q": "oral health", "jurisdiction": "AL"}
response = await client.get(f"{base_url}/bills", params=params)

# Bulk: Would need to download all bills, then search locally

Use Case 3: Single Bill Lookup

Need: Get details for one specific bill

# API is instant
response = await client.get(f"{base_url}/bills/AL/2024/HB123")

# Bulk: Download entire session just for one bill

When Bulk Downloads are Better

Use Case 1: All-State Analysis

Need: Map legislation across all 50 states

API Approach:

# 50 states × 100 requests per state = 5,000 API calls
# Time: ~2 hours (with rate limiting)
# Risk: Hit API quota limit

Bulk Approach:

# Download all 50 state CSV files
# Time: ~5 minutes
# API calls: 0
# No quota concerns

Winner: Bulk (50x faster)

Use Case 2: Historical Trends

Need: Analyze fluoridation bills from 2010-2024

API Approach:

# 50 states × 15 years × 100 requests = 75,000 API calls
# Time: Would exceed free tier quota
# Cost: Need paid plan

Bulk Approach:

# Download 50 states × 15 years = 750 CSV files
# Time: ~30 minutes
# Cost: Free, no limits

Winner: Bulk (only viable option)

Use Case 3: Offline Processing

Need: Process data without internet

API Approach:

# Must cache all API responses locally
# Complex caching logic needed
# Cache invalidation issues

Bulk Approach:

# Download once, process forever
# No internet needed after download
# Simple file-based workflow

Winner: Bulk (simpler)

Hybrid Approach (Best of Both Worlds)

Strategy: Bulk for foundation, API for updates

# 1. Download complete 2024 session (bulk)
!python scripts/bulk_legislative_download.py --year 2024 --merge

# 2. Load bulk data
df = pd.read_csv('data/cache/legislation_bulk/all_states_2024.csv')
print(f"Loaded {len(df)} bills from bulk download")

# 3. Use API for recent updates (last 7 days)
from datetime import datetime, timedelta
recent_cutoff = datetime.now() - timedelta(days=7)

# API search for bills updated in last week
async def get_recent_updates():
    params = {
        "updated_since": recent_cutoff.isoformat(),
        "jurisdiction": "all"
    }
    return await api_client.get("/bills", params=params)

recent = await get_recent_updates()

# 4. Merge bulk + recent updates
combined = pd.concat([df, recent])

Benefits:

Complete historical data (bulk)
Real-time updates (API)
Minimal API calls (only recent changes)

Recommendations by Project Type

Academic Research

→ Use Bulk Downloads

Need complete datasets
Historical analysis
No real-time requirements
May publish/redistribute

News/Journalism

→ Use API

Need latest bill status
Breaking news coverage
Specific bill tracking
Real-time alerts

Advocacy Campaigns

→ Use Hybrid

Bulk for initial analysis
API for monitoring active bills
Alerts when bills advance
Historical context + real-time

Government Dashboards

→ Use Hybrid

Bulk for historical trends
API for current session
Daily/weekly refresh
Public redistribution

Cost Analysis

Free Tier Limits

API:

50,000 requests/month free
~100 bills per request (pagination)
= ~5M bill records/month max

Bulk:

Unlimited downloads
~100K bills per download
= Unlimited bill records/month

Time to Download All States (2024)

API (50 states):

50 states × 100 API calls = 5,000 requests
5,000 requests × 0.5s rate limit = 2,500 seconds = ~42 minutes
(Not including processing time)

Bulk (50 states):

50 CSV downloads × 5s each = 250 seconds = ~4 minutes
(Includes all data, no processing needed)

Time Saved: ~38 minutes (10x faster)

Data Completeness

API:

Must paginate through all results
Risk of missing data if pagination fails
Requires careful error handling

Bulk:

Complete session in one file
Guaranteed completeness
No pagination errors

PostgreSQL Dump Option

For power users:

# Download complete Open States database
python scripts/bulk_legislative_download.py --postgres --month 2026-04

# Restore to local PostgreSQL
pg_restore -d openstates 2026-04-public.pgdump

# Now use SQL for analysis!
psql openstates -c "
  SELECT state, COUNT(*) as bill_count
  FROM bills
  WHERE session_year = 2024
  GROUP BY state
  ORDER BY bill_count DESC;
"

Benefits:

Complete database with relationships
SQL queries for complex analysis
No need for Python/pandas
Can use PostgreSQL extensions
Best for large-scale research

Drawbacks:

Large file size (~5GB compressed)
Requires PostgreSQL installation
More complex setup

Final Recommendation

Default choice: Bulk Downloads

Reasons:

Faster (10x speed improvement)
No API key setup
No rate limits
Work offline
Complete sessions guaranteed

Switch to API when:

Need real-time status
Tracking specific bills
Keyword search required
Small subset of data

Use Both when:

Initial bulk download
Periodic API updates
Best of both worlds