Spaces:
Runtime error
Runtime error
| # Distinct Legislators Extractor | |
| Extracts a deduplicated list of legislators from Voteview's HSall_members data, aggregating congress sessions served for each legislator (96th congress onward = 1979+). | |
| ## Source Data | |
| - **Source:** Voteview HSall_members.parquet | |
| - **URL:** HuggingFace dataset `Dustinhax/tyt` | |
| - **Original structure:** One row per legislator per congress session | |
| - **Output structure:** One row per legislator (aggregated) | |
| ### Filtering | |
| - **Congress filter:** >= 96 (1979-1980 onward) | |
| - **Null filter:** `bioguide_id IS NOT NULL` (excludes ~17 records without bioguide) | |
| - These are typically Presidents or historical members without bioguide IDs | |
| ## Installation | |
| Requires Python 3.13+ and the following dependencies: | |
| ```bash | |
| uv pip install duckdb pyarrow | |
| ``` | |
| ## Usage | |
| ### Command Line | |
| ```bash | |
| # Basic usage | |
| python -m distinct_legislators legislators.parquet | |
| # Custom congress filter | |
| python -m distinct_legislators legislators.parquet --min-congress 100 | |
| # Skip validation (not recommended) | |
| python -m distinct_legislators legislators.parquet --no-validate | |
| # Custom sample sizes for validation | |
| python -m distinct_legislators legislators.parquet --aggregation-sample 200 --deep-sample 100 | |
| # Reproducible validation with fixed random seed | |
| python -m distinct_legislators legislators.parquet --seed 42 | |
| ``` | |
| ### Python API | |
| ```python | |
| from distinct_legislators import extract_distinct_legislators | |
| # Extract with default settings | |
| result = extract_distinct_legislators("legislators.parquet") | |
| print(f"Extracted {result.output_count:,} legislators") | |
| print(f"Validation passed: {result.validation.all_valid}") | |
| # Custom options | |
| result = extract_distinct_legislators( | |
| "legislators.parquet", | |
| min_congress=100, # 100th congress (1987) onward | |
| aggregation_sample_size=200, | |
| deep_sample_size=100, | |
| ) | |
| ``` | |
| ## Output Schema | |
| | Column | Type | Description | | |
| |--------|------|-------------| | |
| | bioguide_id | VARCHAR | Primary key (unique per legislator) | | |
| | bioname | VARCHAR | Full name from Voteview (most recent) | | |
| | state_abbrev | VARCHAR | Most recent state represented | | |
| | party_code | DOUBLE | Most recent party (100=Dem, 200=Rep) | | |
| | congresses_served | INT16[] | Array of all congress numbers served | | |
| | first_congress | INT16 | Earliest congress (MIN) for range queries | | |
| | last_congress | INT16 | Latest congress (MAX) for range queries | | |
| | nominate_dim1 | DOUBLE | Economic liberalism/conservatism score | | |
| | nominate_dim2 | DOUBLE | Social issues score | | |
| ### Congress Numbers to Years | |
| Congress N covers years `(1787 + 2*N)` to `(1788 + 2*N)`: | |
| - Congress 96 = 1979-1980 | |
| - Congress 100 = 1987-1988 | |
| - Congress 119 = 2025-2026 | |
| ## Data Interpretation Decisions | |
| ### Aggregation Rules | |
| | Field | Aggregation | Rationale | | |
| |-------|-------------|-----------| | |
| | bioguide_id | GROUP BY | Primary key, unique per legislator | | |
| | bioname | LAST by congress | Name format may change; use most recent | | |
| | state_abbrev | LAST by congress | Legislators may change states (rare) | | |
| | party_code | LAST by congress | Party affiliation may change over career | | |
| | congresses_served | LIST ordered | Complete history of all sessions served | | |
| | first_congress | MIN | Earliest congress served (career start) | | |
| | last_congress | MAX | Latest congress served (current or end) | | |
| | nominate_dim1 | LAST by congress | Ideology score from most recent session | | |
| | nominate_dim2 | LAST by congress | Ideology score from most recent session | | |
| ### Party Codes | |
| | Code | Party | | |
| |------|-------| | |
| | 100 | Democrat | | |
| | 200 | Republican | | |
| | 328 | Independent | | |
| | Other | Historical parties (Whig, Federalist, etc.) | | |
| ### NOMINATE Scores | |
| - **dim1:** Economic liberalism/conservatism (-1 to +1, negative=liberal) | |
| - **dim2:** Social issues/civil rights (-1 to +1, interpretation varies by era) | |
| - Scores are session-specific; we keep the most recent for simplicity | |
| ### Known Edge Cases | |
| 1. **Party switchers:** Uses most recent party (e.g., Arlen Specter shows Democrat) | |
| 2. **State changers:** Uses most recent state (rare, but possible) | |
| 3. **Gaps in service:** `congresses_served` array handles non-consecutive terms | |
| 4. **Presidents:** Excluded (no bioguide_id in Voteview data) | |
| ## Validation | |
| Three-tier validation ensures correct aggregation: | |
| ### Tier 1: Completeness | |
| Verifies every source bioguide_id appears exactly once in output: | |
| - Output count matches distinct source count | |
| - No missing bioguide_ids | |
| - No extra bioguide_ids | |
| - No duplicates in output | |
| ### Tier 2: Aggregation Integrity | |
| Randomly samples legislators and verifies: | |
| - `first_congress` = MIN(congress) from source | |
| - `last_congress` = MAX(congress) from source | |
| - `congresses_served` array length matches source row count | |
| ### Tier 3: Sample Verification | |
| Deep validation of random legislators: | |
| - `congresses_served` array contains exactly the right congress numbers | |
| - `bioname` matches the most recent congress entry | |
| - `state_abbrev` matches the most recent congress entry | |
| ## Technical Details | |
| ### Compression | |
| - Algorithm: ZSTD (Zstandard) | |
| - Typical output size: ~53 KB for ~2,300 legislators | |
| ### Query Engine | |
| - Uses DuckDB for efficient remote Parquet reading | |
| - Direct URL access to HuggingFace dataset | |
| ## Module Structure | |
| ``` | |
| scripts/distinct_legislators/ | |
| βββ __init__.py # Public API exports | |
| βββ __main__.py # Module entry point | |
| βββ cli.py # Command-line interface | |
| βββ extractor.py # Core extraction logic | |
| βββ exceptions.py # Exception hierarchy | |
| βββ schema.py # Schema and aggregation SQL | |
| βββ validators.py # Three-tier validation | |
| βββ README.md # This file | |
| ``` | |
| ## Error Handling | |
| All errors inherit from `DistinctLegislatorsError` with detailed context: | |
| - `SourceReadError`: Source URL and error details | |
| - `CompletenessError`: Expected/actual counts, missing/extra IDs | |
| - `AggregationError`: bioguide_id, field name, expected/actual values | |
| - `SampleValidationError`: Same as above plus sample index | |
| - `OutputWriteError`: Output path and error details | |
| ## HuggingFace Upload | |
| After extraction, upload validated files to the processed data repository: | |
| ```bash | |
| # Target location (processed data, not raw) | |
| https://huggingface.co/datasets/Dustinhax/paper-trail-data | |
| # Using HuggingFace CLI | |
| hf upload Dustinhax/paper-trail-data distinct_legislators.parquet distinct_legislators.parquet --repo-type dataset | |
| ``` | |
| Note: Raw source data lives in `Dustinhax/tyt`. Processed/transformed data goes to `Dustinhax/paper-trail-data`. | |
| ## Citation | |
| When using this data, cite the original Voteview source: | |
| ``` | |
| Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche, | |
| Aaron Rudkin, and Luke Sonnet. 2024. Voteview: Congressional | |
| Roll-Call Votes Database. https://voteview.com/ | |
| ``` | |