Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # OpenStates Data Scripts | |
| Scripts for working with [OpenStates](https://openstates.org/) legislative data. | |
| ## Data Source | |
| - **Website**: https://openstates.org/ | |
| - **API Docs**: https://docs.openstates.org/api-v3/ | |
| - **Bulk Data**: https://openstates.org/data/ | |
| - **Coverage**: 52 US states and territories | |
| - **Data Types**: Bills, legislators, votes, committees | |
| ## Scripts | |
| ### Download & Import | |
| - `bulk_legislative_download.py` - Download OpenStates bulk data dumps (PostgreSQL or CSV) | |
| - `download_documents.py` - Download 4.5M+ bill documents from PostgreSQL database | |
| - `load_openstates_csv.sh` - Load CSV exports into database | |
| - `load_openstates_people.py` - Load legislator data from GitHub repo | |
| ### Schema & Export | |
| - `create_openstates_schema.py` - Create PostgreSQL schema for OpenStates data | |
| - `export_openstates_to_gold.py` - Export from PostgreSQL to Gold Parquet files | |
| ### Processing | |
| - `aggregate_bills_from_postgres.py` - Aggregate bill statistics by state/topic | |
| - `legislative_tracker.py` - Track legislative activity | |
| ## Usage Examples | |
| ```bash | |
| # Download April 2026 bulk data to PostgreSQL | |
| python bulk_legislative_download.py --postgres --month 2026-04 | |
| # Download bill documents from database (2025 only) | |
| python download_documents.py --years 2025 --type documents | |
| # Download all documents for recent years | |
| python download_documents.py --years 2024,2025 --resume | |
| # Test document download with limit | |
| python download_documents.py --limit 100 --dry-run | |
| # Create database schema | |
| python create_openstates_schema.py | |
| # Load legislator data | |
| python load_openstates_people.py | |
| # Export to Gold format | |
| python export_openstates_to_gold.py | |
| # Aggregate bill statistics | |
| python aggregate_bills_from_postgres.py | |
| ``` | |
| ## Document Downloader Details | |
| The `download_documents.py` script downloads actual bill documents (PDFs, Word docs, etc.) from state legislature websites. | |
| **Database Requirements:** | |
| - PostgreSQL with OpenStates data (from `bulk_legislative_download.py`) | |
| - Default: `postgresql://postgres:password@localhost:5433/openstates` | |
| **Document Types:** | |
| - **Bill Versions** (3.5M): Text of bills as introduced, amended, enrolled, etc. | |
| - **Bill Documents** (1M): Fiscal notes, committee statements, amendments, etc. | |
| **Features:** | |
| - Organizes by year: `/mnt/d/openstates_documents/documents/2025/` | |
| - Snake_case filenames: `hb_1234_fiscal_note_introduced.pdf` | |
| - Resume capability with JSON logging | |
| - Progress updates every 1,000 files | |
| - Rate limiting (0.1s between requests) | |
| - Handles 403/404 errors gracefully | |
| **Output Structure:** | |
| ``` | |
| /mnt/d/openstates_documents/ | |
| βββ versions/ # Bill version texts | |
| β βββ 2017/ | |
| β βββ 2025/ | |
| β βββ ... | |
| βββ documents/ # Supporting documents | |
| β βββ 2025/ | |
| β βββ ... | |
| βββ download_log.json # Progress tracking | |
| ``` | |
| **Recommended Usage:** | |
| ```bash | |
| # Start with recent year (testing) | |
| python download_documents.py --years 2025 --type documents --limit 1000 | |
| # Download all 2024-2025 documents | |
| python download_documents.py --years 2024,2025 --resume | |
| # Full download (WARNING: 4.5M files, ~2TB, takes days!) | |
| python download_documents.py --resume | |
| ``` | |
| **Important Notes:** | |
| - 4.5 million documents will take several days to download | |
| - Estimated size: ~2TB (average 500KB per document) | |
| - Use `--resume` to continue interrupted downloads | |
| - Progress saved every 100 downloads | |
| - Failed downloads logged separately | |
| ## Data License | |
| OpenStates data is available under various open licenses. See https://openstates.org/data/ for details. | |