Spaces:
Runtime error
Distinct Legislators Extractor
Extracts a deduplicated list of legislators from Voteview's HSall_members data, aggregating congress sessions served for each legislator (96th congress onward = 1979+).
Source Data
- Source: Voteview HSall_members.parquet
- URL: HuggingFace dataset
Dustinhax/tyt - Original structure: One row per legislator per congress session
- Output structure: One row per legislator (aggregated)
Filtering
- Congress filter: >= 96 (1979-1980 onward)
- Null filter:
bioguide_id IS NOT NULL(excludes ~17 records without bioguide)- These are typically Presidents or historical members without bioguide IDs
Installation
Requires Python 3.13+ and the following dependencies:
uv pip install duckdb pyarrow
Usage
Command Line
# Basic usage
python -m distinct_legislators legislators.parquet
# Custom congress filter
python -m distinct_legislators legislators.parquet --min-congress 100
# Skip validation (not recommended)
python -m distinct_legislators legislators.parquet --no-validate
# Custom sample sizes for validation
python -m distinct_legislators legislators.parquet --aggregation-sample 200 --deep-sample 100
# Reproducible validation with fixed random seed
python -m distinct_legislators legislators.parquet --seed 42
Python API
from distinct_legislators import extract_distinct_legislators
# Extract with default settings
result = extract_distinct_legislators("legislators.parquet")
print(f"Extracted {result.output_count:,} legislators")
print(f"Validation passed: {result.validation.all_valid}")
# Custom options
result = extract_distinct_legislators(
"legislators.parquet",
min_congress=100, # 100th congress (1987) onward
aggregation_sample_size=200,
deep_sample_size=100,
)
Output Schema
| Column | Type | Description |
|---|---|---|
| bioguide_id | VARCHAR | Primary key (unique per legislator) |
| bioname | VARCHAR | Full name from Voteview (most recent) |
| state_abbrev | VARCHAR | Most recent state represented |
| party_code | DOUBLE | Most recent party (100=Dem, 200=Rep) |
| congresses_served | INT16[] | Array of all congress numbers served |
| first_congress | INT16 | Earliest congress (MIN) for range queries |
| last_congress | INT16 | Latest congress (MAX) for range queries |
| nominate_dim1 | DOUBLE | Economic liberalism/conservatism score |
| nominate_dim2 | DOUBLE | Social issues score |
Congress Numbers to Years
Congress N covers years (1787 + 2*N) to (1788 + 2*N):
- Congress 96 = 1979-1980
- Congress 100 = 1987-1988
- Congress 119 = 2025-2026
Data Interpretation Decisions
Aggregation Rules
| Field | Aggregation | Rationale |
|---|---|---|
| bioguide_id | GROUP BY | Primary key, unique per legislator |
| bioname | LAST by congress | Name format may change; use most recent |
| state_abbrev | LAST by congress | Legislators may change states (rare) |
| party_code | LAST by congress | Party affiliation may change over career |
| congresses_served | LIST ordered | Complete history of all sessions served |
| first_congress | MIN | Earliest congress served (career start) |
| last_congress | MAX | Latest congress served (current or end) |
| nominate_dim1 | LAST by congress | Ideology score from most recent session |
| nominate_dim2 | LAST by congress | Ideology score from most recent session |
Party Codes
| Code | Party |
|---|---|
| 100 | Democrat |
| 200 | Republican |
| 328 | Independent |
| Other | Historical parties (Whig, Federalist, etc.) |
NOMINATE Scores
- dim1: Economic liberalism/conservatism (-1 to +1, negative=liberal)
- dim2: Social issues/civil rights (-1 to +1, interpretation varies by era)
- Scores are session-specific; we keep the most recent for simplicity
Known Edge Cases
- Party switchers: Uses most recent party (e.g., Arlen Specter shows Democrat)
- State changers: Uses most recent state (rare, but possible)
- Gaps in service:
congresses_servedarray handles non-consecutive terms - Presidents: Excluded (no bioguide_id in Voteview data)
Validation
Three-tier validation ensures correct aggregation:
Tier 1: Completeness
Verifies every source bioguide_id appears exactly once in output:
- Output count matches distinct source count
- No missing bioguide_ids
- No extra bioguide_ids
- No duplicates in output
Tier 2: Aggregation Integrity
Randomly samples legislators and verifies:
first_congress= MIN(congress) from sourcelast_congress= MAX(congress) from sourcecongresses_servedarray length matches source row count
Tier 3: Sample Verification
Deep validation of random legislators:
congresses_servedarray contains exactly the right congress numbersbionamematches the most recent congress entrystate_abbrevmatches the most recent congress entry
Technical Details
Compression
- Algorithm: ZSTD (Zstandard)
- Typical output size: ~53 KB for ~2,300 legislators
Query Engine
- Uses DuckDB for efficient remote Parquet reading
- Direct URL access to HuggingFace dataset
Module Structure
scripts/distinct_legislators/
βββ __init__.py # Public API exports
βββ __main__.py # Module entry point
βββ cli.py # Command-line interface
βββ extractor.py # Core extraction logic
βββ exceptions.py # Exception hierarchy
βββ schema.py # Schema and aggregation SQL
βββ validators.py # Three-tier validation
βββ README.md # This file
Error Handling
All errors inherit from DistinctLegislatorsError with detailed context:
SourceReadError: Source URL and error detailsCompletenessError: Expected/actual counts, missing/extra IDsAggregationError: bioguide_id, field name, expected/actual valuesSampleValidationError: Same as above plus sample indexOutputWriteError: Output path and error details
HuggingFace Upload
After extraction, upload validated files to the processed data repository:
# Target location (processed data, not raw)
https://huggingface.co/datasets/Dustinhax/paper-trail-data
# Using HuggingFace CLI
hf upload Dustinhax/paper-trail-data distinct_legislators.parquet distinct_legislators.parquet --repo-type dataset
Note: Raw source data lives in Dustinhax/tyt. Processed/transformed data goes to Dustinhax/paper-trail-data.
Citation
When using this data, cite the original Voteview source:
Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche,
Aaron Rudkin, and Luke Sonnet. 2024. Voteview: Congressional
Roll-Call Votes Database. https://voteview.com/