Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 3,556 Bytes
61d29fc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | # OpenStates Data Scripts
Scripts for working with [OpenStates](https://openstates.org/) legislative data.
## Data Source
- **Website**: https://openstates.org/
- **API Docs**: https://docs.openstates.org/api-v3/
- **Bulk Data**: https://openstates.org/data/
- **Coverage**: 52 US states and territories
- **Data Types**: Bills, legislators, votes, committees
## Scripts
### Download & Import
- `bulk_legislative_download.py` - Download OpenStates bulk data dumps (PostgreSQL or CSV)
- `download_documents.py` - Download 4.5M+ bill documents from PostgreSQL database
- `load_openstates_csv.sh` - Load CSV exports into database
- `load_openstates_people.py` - Load legislator data from GitHub repo
### Schema & Export
- `create_openstates_schema.py` - Create PostgreSQL schema for OpenStates data
- `export_openstates_to_gold.py` - Export from PostgreSQL to Gold Parquet files
### Processing
- `aggregate_bills_from_postgres.py` - Aggregate bill statistics by state/topic
- `legislative_tracker.py` - Track legislative activity
## Usage Examples
```bash
# Download April 2026 bulk data to PostgreSQL
python bulk_legislative_download.py --postgres --month 2026-04
# Download bill documents from database (2025 only)
python download_documents.py --years 2025 --type documents
# Download all documents for recent years
python download_documents.py --years 2024,2025 --resume
# Test document download with limit
python download_documents.py --limit 100 --dry-run
# Create database schema
python create_openstates_schema.py
# Load legislator data
python load_openstates_people.py
# Export to Gold format
python export_openstates_to_gold.py
# Aggregate bill statistics
python aggregate_bills_from_postgres.py
```
## Document Downloader Details
The `download_documents.py` script downloads actual bill documents (PDFs, Word docs, etc.) from state legislature websites.
**Database Requirements:**
- PostgreSQL with OpenStates data (from `bulk_legislative_download.py`)
- Default: `postgresql://postgres:password@localhost:5433/openstates`
**Document Types:**
- **Bill Versions** (3.5M): Text of bills as introduced, amended, enrolled, etc.
- **Bill Documents** (1M): Fiscal notes, committee statements, amendments, etc.
**Features:**
- Organizes by year: `/mnt/d/openstates_documents/documents/2025/`
- Snake_case filenames: `hb_1234_fiscal_note_introduced.pdf`
- Resume capability with JSON logging
- Progress updates every 1,000 files
- Rate limiting (0.1s between requests)
- Handles 403/404 errors gracefully
**Output Structure:**
```
/mnt/d/openstates_documents/
βββ versions/ # Bill version texts
β βββ 2017/
β βββ 2025/
β βββ ...
βββ documents/ # Supporting documents
β βββ 2025/
β βββ ...
βββ download_log.json # Progress tracking
```
**Recommended Usage:**
```bash
# Start with recent year (testing)
python download_documents.py --years 2025 --type documents --limit 1000
# Download all 2024-2025 documents
python download_documents.py --years 2024,2025 --resume
# Full download (WARNING: 4.5M files, ~2TB, takes days!)
python download_documents.py --resume
```
**Important Notes:**
- 4.5 million documents will take several days to download
- Estimated size: ~2TB (average 500KB per document)
- Use `--resume` to continue interrupted downloads
- Progress saved every 100 downloads
- Failed downloads logged separately
## Data License
OpenStates data is available under various open licenses. See https://openstates.org/data/ for details.
|