jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc

OpenStates Data Scripts

Scripts for working with OpenStates legislative data.

Data Source

Scripts

Download & Import

  • bulk_legislative_download.py - Download OpenStates bulk data dumps (PostgreSQL or CSV)
  • download_documents.py - Download 4.5M+ bill documents from PostgreSQL database
  • load_openstates_csv.sh - Load CSV exports into database
  • load_openstates_people.py - Load legislator data from GitHub repo

Schema & Export

  • create_openstates_schema.py - Create PostgreSQL schema for OpenStates data
  • export_openstates_to_gold.py - Export from PostgreSQL to Gold Parquet files

Processing

  • aggregate_bills_from_postgres.py - Aggregate bill statistics by state/topic
  • legislative_tracker.py - Track legislative activity

Usage Examples

# Download April 2026 bulk data to PostgreSQL
python bulk_legislative_download.py --postgres --month 2026-04

# Download bill documents from database (2025 only)
python download_documents.py --years 2025 --type documents

# Download all documents for recent years
python download_documents.py --years 2024,2025 --resume

# Test document download with limit
python download_documents.py --limit 100 --dry-run

# Create database schema
python create_openstates_schema.py

# Load legislator data
python load_openstates_people.py

# Export to Gold format
python export_openstates_to_gold.py

# Aggregate bill statistics
python aggregate_bills_from_postgres.py

Document Downloader Details

The download_documents.py script downloads actual bill documents (PDFs, Word docs, etc.) from state legislature websites.

Database Requirements:

  • PostgreSQL with OpenStates data (from bulk_legislative_download.py)
  • Default: postgresql://postgres:password@localhost:5433/openstates

Document Types:

  • Bill Versions (3.5M): Text of bills as introduced, amended, enrolled, etc.
  • Bill Documents (1M): Fiscal notes, committee statements, amendments, etc.

Features:

  • Organizes by year: /mnt/d/openstates_documents/documents/2025/
  • Snake_case filenames: hb_1234_fiscal_note_introduced.pdf
  • Resume capability with JSON logging
  • Progress updates every 1,000 files
  • Rate limiting (0.1s between requests)
  • Handles 403/404 errors gracefully

Output Structure:

/mnt/d/openstates_documents/
β”œβ”€β”€ versions/           # Bill version texts
β”‚   β”œβ”€β”€ 2017/
β”‚   β”œβ”€β”€ 2025/
β”‚   └── ...
β”œβ”€β”€ documents/          # Supporting documents
β”‚   β”œβ”€β”€ 2025/
β”‚   └── ...
└── download_log.json   # Progress tracking

Recommended Usage:

# Start with recent year (testing)
python download_documents.py --years 2025 --type documents --limit 1000

# Download all 2024-2025 documents
python download_documents.py --years 2024,2025 --resume

# Full download (WARNING: 4.5M files, ~2TB, takes days!)
python download_documents.py --resume

Important Notes:

  • 4.5 million documents will take several days to download
  • Estimated size: ~2TB (average 500KB per document)
  • Use --resume to continue interrupted downloads
  • Progress saved every 100 downloads
  • Failed downloads logged separately

Data License

OpenStates data is available under various open licenses. See https://openstates.org/data/ for details.