Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / docs /BULK_VS_API.md

jcbowyer

Deploy: Consolidated gold tables, fixed nginx docs routing

896453f verified 28 days ago

preview code

raw

history blame contribute delete

7.41 kB

	# Bulk Downloads vs API: Which to Use?

	## TL;DR

	Use Bulk Downloads for:
	- ✅ Historical analysis (analyzing past sessions)
	- ✅ Map generation (need all states at once)
	- ✅ Research projects (large datasets)
	- ✅ Offline processing
	- ✅ Multi-issue tracking across all states

	Use API for:
	- ✅ Real-time bill status (same-day updates)
	- ✅ Search by specific keywords
	- ✅ Individual bill lookups
	- ✅ Automated alerts for bill changes

	---

	## Comparison Table

	\| Feature \| Bulk Download \| API \|
	\|---------\|--------------\|-----\|
	\| Speed (50 states) \| ⚡ 5-10 minutes \| 🐌 2-4 hours \|
	\| API Key Required \| ❌ No \| ✅ Yes \|
	\| Rate Limits \| ❌ None \| ⚠️ 50K/month \|
	\| Internet Required \| Download once \| Always \|
	\| Data Freshness \| Monthly updates \| Real-time \|
	\| Bill Text \| ✅ Full text (JSON) \| ✅ Via API \|
	\| Complete Sessions \| ✅ All bills \| Paginated \|
	\| Cost \| 💰 Free \| 💰 Free (50K limit) \|
	\| Redistribution \| ✅ Allowed \| ⚠️ Varies by state \|

	---

	## Real-World Example

	### Task: Create fluoridation legislation map for all 50 states (2024)

	#### Method 1: Bulk Download

	```bash
	# Download all 50 states
	python scripts/bulk_legislative_download.py --year 2024 --format csv --merge

	# Time: ~5 minutes
	# API calls: 0
	# Result: 1 CSV file with ALL bills
	```

	Result: One 500MB file with ~100,000 bills from all states

	#### Method 2: API

	```bash
	# Search each state individually
	python scripts/legislative_tracker.py --issue fluoridation --year 2024

	# Time: ~2-4 hours
	# API calls: ~10,000 (search + pagination)
	# Result: Filtered bills matching "fluoridation"
	```

	Result: Filtered dataset with ~500 matching bills

	---

	## When API is Better

	### Use Case 1: Real-Time Bill Tracking

	Need: Alert when a specific bill status changes

	```python
	# API can check latest status
	async def check_bill_status(bill_id):
	response = await client.get(f"{base_url}/bills/{bill_id}")
	return response.json()['latest_action']

	# Bulk: Would need to wait for next monthly dump
	```

	### Use Case 2: Keyword Search

	Need: Find all bills mentioning "oral health"

	```python
	# API can search full text
	params = {"q": "oral health", "jurisdiction": "AL"}
	response = await client.get(f"{base_url}/bills", params=params)

	# Bulk: Would need to download all bills, then search locally
	```

	### Use Case 3: Single Bill Lookup

	Need: Get details for one specific bill

	```python
	# API is instant
	response = await client.get(f"{base_url}/bills/AL/2024/HB123")

	# Bulk: Download entire session just for one bill
	```

	---

	## When Bulk Downloads are Better

	### Use Case 1: All-State Analysis

	Need: Map legislation across all 50 states

	API Approach:
	```python
	# 50 states × 100 requests per state = 5,000 API calls
	# Time: ~2 hours (with rate limiting)
	# Risk: Hit API quota limit
	```

	Bulk Approach:
	```python
	# Download all 50 state CSV files
	# Time: ~5 minutes
	# API calls: 0
	# No quota concerns
	```

	Winner: Bulk (50x faster)

	### Use Case 2: Historical Trends

	Need: Analyze fluoridation bills from 2010-2024

	API Approach:
	```python
	# 50 states × 15 years × 100 requests = 75,000 API calls
	# Time: Would exceed free tier quota
	# Cost: Need paid plan
	```

	Bulk Approach:
	```python
	# Download 50 states × 15 years = 750 CSV files
	# Time: ~30 minutes
	# Cost: Free, no limits
	```

	Winner: Bulk (only viable option)

	### Use Case 3: Offline Processing

	Need: Process data without internet

	API Approach:
	```python
	# Must cache all API responses locally
	# Complex caching logic needed
	# Cache invalidation issues
	```

	Bulk Approach:
	```python
	# Download once, process forever
	# No internet needed after download
	# Simple file-based workflow
	```

	Winner: Bulk (simpler)

	---

	## Hybrid Approach (Best of Both Worlds)

	### Strategy: Bulk for foundation, API for updates

	```python
	# 1. Download complete 2024 session (bulk)
	!python scripts/bulk_legislative_download.py --year 2024 --merge

	# 2. Load bulk data
	df = pd.read_csv('data/cache/legislation_bulk/all_states_2024.csv')
	print(f"Loaded {len(df)} bills from bulk download")

	# 3. Use API for recent updates (last 7 days)
	from datetime import datetime, timedelta
	recent_cutoff = datetime.now() - timedelta(days=7)

	# API search for bills updated in last week
	async def get_recent_updates():
	params = {
	"updated_since": recent_cutoff.isoformat(),
	"jurisdiction": "all"
	}
	return await api_client.get("/bills", params=params)

	recent = await get_recent_updates()

	# 4. Merge bulk + recent updates
	combined = pd.concat([df, recent])
	```

	Benefits:
	- Complete historical data (bulk)
	- Real-time updates (API)
	- Minimal API calls (only recent changes)

	---

	## Recommendations by Project Type

	### Academic Research
	→ Use Bulk Downloads
	- Need complete datasets
	- Historical analysis
	- No real-time requirements
	- May publish/redistribute

	### News/Journalism
	→ Use API
	- Need latest bill status
	- Breaking news coverage
	- Specific bill tracking
	- Real-time alerts

	### Advocacy Campaigns
	→ Use Hybrid
	- Bulk for initial analysis
	- API for monitoring active bills
	- Alerts when bills advance
	- Historical context + real-time

	### Government Dashboards
	→ Use Hybrid
	- Bulk for historical trends
	- API for current session
	- Daily/weekly refresh
	- Public redistribution

	---

	## Cost Analysis

	### Free Tier Limits

	API:
	- 50,000 requests/month free
	- ~100 bills per request (pagination)
	- = ~5M bill records/month max

	Bulk:
	- Unlimited downloads
	- ~100K bills per download
	- = Unlimited bill records/month

	### Time to Download All States (2024)

	API (50 states):
	```
	50 states × 100 API calls = 5,000 requests
	5,000 requests × 0.5s rate limit = 2,500 seconds = ~42 minutes
	(Not including processing time)
	```

	Bulk (50 states):
	```
	50 CSV downloads × 5s each = 250 seconds = ~4 minutes
	(Includes all data, no processing needed)
	```

	Time Saved: ~38 minutes (10x faster)

	### Data Completeness

	API:
	- Must paginate through all results
	- Risk of missing data if pagination fails
	- Requires careful error handling

	Bulk:
	- Complete session in one file
	- Guaranteed completeness
	- No pagination errors

	---

	## PostgreSQL Dump Option

	For power users:

	```bash
	# Download complete Open States database
	python scripts/bulk_legislative_download.py --postgres --month 2026-04

	# Restore to local PostgreSQL
	pg_restore -d openstates 2026-04-public.pgdump

	# Now use SQL for analysis!
	psql openstates -c "
	SELECT state, COUNT(*) as bill_count
	FROM bills
	WHERE session_year = 2024
	GROUP BY state
	ORDER BY bill_count DESC;
	"
	```

	Benefits:
	- Complete database with relationships
	- SQL queries for complex analysis
	- No need for Python/pandas
	- Can use PostgreSQL extensions
	- Best for large-scale research

	Drawbacks:
	- Large file size (~5GB compressed)
	- Requires PostgreSQL installation
	- More complex setup

	---

	## Final Recommendation

	Default choice: Bulk Downloads

	Reasons:
	1. Faster (10x speed improvement)
	2. No API key setup
	3. No rate limits
	4. Work offline
	5. Complete sessions guaranteed

	Switch to API when:
	- Need real-time status
	- Tracking specific bills
	- Keyword search required
	- Small subset of data

	Use Both when:
	- Initial bulk download
	- Periodic API updates
	- Best of both worlds