# Distinct Legislators Extractor Extracts a deduplicated list of legislators from Voteview's HSall_members data, aggregating congress sessions served for each legislator (96th congress onward = 1979+). ## Source Data - **Source:** Voteview HSall_members.parquet - **URL:** HuggingFace dataset `Dustinhax/tyt` - **Original structure:** One row per legislator per congress session - **Output structure:** One row per legislator (aggregated) ### Filtering - **Congress filter:** >= 96 (1979-1980 onward) - **Null filter:** `bioguide_id IS NOT NULL` (excludes ~17 records without bioguide) - These are typically Presidents or historical members without bioguide IDs ## Installation Requires Python 3.13+ and the following dependencies: ```bash uv pip install duckdb pyarrow ``` ## Usage ### Command Line ```bash # Basic usage python -m distinct_legislators legislators.parquet # Custom congress filter python -m distinct_legislators legislators.parquet --min-congress 100 # Skip validation (not recommended) python -m distinct_legislators legislators.parquet --no-validate # Custom sample sizes for validation python -m distinct_legislators legislators.parquet --aggregation-sample 200 --deep-sample 100 # Reproducible validation with fixed random seed python -m distinct_legislators legislators.parquet --seed 42 ``` ### Python API ```python from distinct_legislators import extract_distinct_legislators # Extract with default settings result = extract_distinct_legislators("legislators.parquet") print(f"Extracted {result.output_count:,} legislators") print(f"Validation passed: {result.validation.all_valid}") # Custom options result = extract_distinct_legislators( "legislators.parquet", min_congress=100, # 100th congress (1987) onward aggregation_sample_size=200, deep_sample_size=100, ) ``` ## Output Schema | Column | Type | Description | |--------|------|-------------| | bioguide_id | VARCHAR | Primary key (unique per legislator) | | bioname | VARCHAR | Full name from Voteview (most recent) | | state_abbrev | VARCHAR | Most recent state represented | | party_code | DOUBLE | Most recent party (100=Dem, 200=Rep) | | congresses_served | INT16[] | Array of all congress numbers served | | first_congress | INT16 | Earliest congress (MIN) for range queries | | last_congress | INT16 | Latest congress (MAX) for range queries | | nominate_dim1 | DOUBLE | Economic liberalism/conservatism score | | nominate_dim2 | DOUBLE | Social issues score | ### Congress Numbers to Years Congress N covers years `(1787 + 2*N)` to `(1788 + 2*N)`: - Congress 96 = 1979-1980 - Congress 100 = 1987-1988 - Congress 119 = 2025-2026 ## Data Interpretation Decisions ### Aggregation Rules | Field | Aggregation | Rationale | |-------|-------------|-----------| | bioguide_id | GROUP BY | Primary key, unique per legislator | | bioname | LAST by congress | Name format may change; use most recent | | state_abbrev | LAST by congress | Legislators may change states (rare) | | party_code | LAST by congress | Party affiliation may change over career | | congresses_served | LIST ordered | Complete history of all sessions served | | first_congress | MIN | Earliest congress served (career start) | | last_congress | MAX | Latest congress served (current or end) | | nominate_dim1 | LAST by congress | Ideology score from most recent session | | nominate_dim2 | LAST by congress | Ideology score from most recent session | ### Party Codes | Code | Party | |------|-------| | 100 | Democrat | | 200 | Republican | | 328 | Independent | | Other | Historical parties (Whig, Federalist, etc.) | ### NOMINATE Scores - **dim1:** Economic liberalism/conservatism (-1 to +1, negative=liberal) - **dim2:** Social issues/civil rights (-1 to +1, interpretation varies by era) - Scores are session-specific; we keep the most recent for simplicity ### Known Edge Cases 1. **Party switchers:** Uses most recent party (e.g., Arlen Specter shows Democrat) 2. **State changers:** Uses most recent state (rare, but possible) 3. **Gaps in service:** `congresses_served` array handles non-consecutive terms 4. **Presidents:** Excluded (no bioguide_id in Voteview data) ## Validation Three-tier validation ensures correct aggregation: ### Tier 1: Completeness Verifies every source bioguide_id appears exactly once in output: - Output count matches distinct source count - No missing bioguide_ids - No extra bioguide_ids - No duplicates in output ### Tier 2: Aggregation Integrity Randomly samples legislators and verifies: - `first_congress` = MIN(congress) from source - `last_congress` = MAX(congress) from source - `congresses_served` array length matches source row count ### Tier 3: Sample Verification Deep validation of random legislators: - `congresses_served` array contains exactly the right congress numbers - `bioname` matches the most recent congress entry - `state_abbrev` matches the most recent congress entry ## Technical Details ### Compression - Algorithm: ZSTD (Zstandard) - Typical output size: ~53 KB for ~2,300 legislators ### Query Engine - Uses DuckDB for efficient remote Parquet reading - Direct URL access to HuggingFace dataset ## Module Structure ``` scripts/distinct_legislators/ ├── __init__.py # Public API exports ├── __main__.py # Module entry point ├── cli.py # Command-line interface ├── extractor.py # Core extraction logic ├── exceptions.py # Exception hierarchy ├── schema.py # Schema and aggregation SQL ├── validators.py # Three-tier validation └── README.md # This file ``` ## Error Handling All errors inherit from `DistinctLegislatorsError` with detailed context: - `SourceReadError`: Source URL and error details - `CompletenessError`: Expected/actual counts, missing/extra IDs - `AggregationError`: bioguide_id, field name, expected/actual values - `SampleValidationError`: Same as above plus sample index - `OutputWriteError`: Output path and error details ## HuggingFace Upload After extraction, upload validated files to the processed data repository: ```bash # Target location (processed data, not raw) https://huggingface.co/datasets/Dustinhax/paper-trail-data # Using HuggingFace CLI hf upload Dustinhax/paper-trail-data distinct_legislators.parquet distinct_legislators.parquet --repo-type dataset ``` Note: Raw source data lives in `Dustinhax/tyt`. Processed/transformed data goes to `Dustinhax/paper-trail-data`. ## Citation When using this data, cite the original Voteview source: ``` Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet. 2024. Voteview: Congressional Roll-Call Votes Database. https://voteview.com/ ```