Hoe
Deploying Backend API
b339b93

Distinct Legislators Extractor

Extracts a deduplicated list of legislators from Voteview's HSall_members data, aggregating congress sessions served for each legislator (96th congress onward = 1979+).

Source Data

  • Source: Voteview HSall_members.parquet
  • URL: HuggingFace dataset Dustinhax/tyt
  • Original structure: One row per legislator per congress session
  • Output structure: One row per legislator (aggregated)

Filtering

  • Congress filter: >= 96 (1979-1980 onward)
  • Null filter: bioguide_id IS NOT NULL (excludes ~17 records without bioguide)
    • These are typically Presidents or historical members without bioguide IDs

Installation

Requires Python 3.13+ and the following dependencies:

uv pip install duckdb pyarrow

Usage

Command Line

# Basic usage
python -m distinct_legislators legislators.parquet

# Custom congress filter
python -m distinct_legislators legislators.parquet --min-congress 100

# Skip validation (not recommended)
python -m distinct_legislators legislators.parquet --no-validate

# Custom sample sizes for validation
python -m distinct_legislators legislators.parquet --aggregation-sample 200 --deep-sample 100

# Reproducible validation with fixed random seed
python -m distinct_legislators legislators.parquet --seed 42

Python API

from distinct_legislators import extract_distinct_legislators

# Extract with default settings
result = extract_distinct_legislators("legislators.parquet")
print(f"Extracted {result.output_count:,} legislators")
print(f"Validation passed: {result.validation.all_valid}")

# Custom options
result = extract_distinct_legislators(
    "legislators.parquet",
    min_congress=100,  # 100th congress (1987) onward
    aggregation_sample_size=200,
    deep_sample_size=100,
)

Output Schema

Column Type Description
bioguide_id VARCHAR Primary key (unique per legislator)
bioname VARCHAR Full name from Voteview (most recent)
state_abbrev VARCHAR Most recent state represented
party_code DOUBLE Most recent party (100=Dem, 200=Rep)
congresses_served INT16[] Array of all congress numbers served
first_congress INT16 Earliest congress (MIN) for range queries
last_congress INT16 Latest congress (MAX) for range queries
nominate_dim1 DOUBLE Economic liberalism/conservatism score
nominate_dim2 DOUBLE Social issues score

Congress Numbers to Years

Congress N covers years (1787 + 2*N) to (1788 + 2*N):

  • Congress 96 = 1979-1980
  • Congress 100 = 1987-1988
  • Congress 119 = 2025-2026

Data Interpretation Decisions

Aggregation Rules

Field Aggregation Rationale
bioguide_id GROUP BY Primary key, unique per legislator
bioname LAST by congress Name format may change; use most recent
state_abbrev LAST by congress Legislators may change states (rare)
party_code LAST by congress Party affiliation may change over career
congresses_served LIST ordered Complete history of all sessions served
first_congress MIN Earliest congress served (career start)
last_congress MAX Latest congress served (current or end)
nominate_dim1 LAST by congress Ideology score from most recent session
nominate_dim2 LAST by congress Ideology score from most recent session

Party Codes

Code Party
100 Democrat
200 Republican
328 Independent
Other Historical parties (Whig, Federalist, etc.)

NOMINATE Scores

  • dim1: Economic liberalism/conservatism (-1 to +1, negative=liberal)
  • dim2: Social issues/civil rights (-1 to +1, interpretation varies by era)
  • Scores are session-specific; we keep the most recent for simplicity

Known Edge Cases

  1. Party switchers: Uses most recent party (e.g., Arlen Specter shows Democrat)
  2. State changers: Uses most recent state (rare, but possible)
  3. Gaps in service: congresses_served array handles non-consecutive terms
  4. Presidents: Excluded (no bioguide_id in Voteview data)

Validation

Three-tier validation ensures correct aggregation:

Tier 1: Completeness

Verifies every source bioguide_id appears exactly once in output:

  • Output count matches distinct source count
  • No missing bioguide_ids
  • No extra bioguide_ids
  • No duplicates in output

Tier 2: Aggregation Integrity

Randomly samples legislators and verifies:

  • first_congress = MIN(congress) from source
  • last_congress = MAX(congress) from source
  • congresses_served array length matches source row count

Tier 3: Sample Verification

Deep validation of random legislators:

  • congresses_served array contains exactly the right congress numbers
  • bioname matches the most recent congress entry
  • state_abbrev matches the most recent congress entry

Technical Details

Compression

  • Algorithm: ZSTD (Zstandard)
  • Typical output size: ~53 KB for ~2,300 legislators

Query Engine

  • Uses DuckDB for efficient remote Parquet reading
  • Direct URL access to HuggingFace dataset

Module Structure

scripts/distinct_legislators/
β”œβ”€β”€ __init__.py      # Public API exports
β”œβ”€β”€ __main__.py      # Module entry point
β”œβ”€β”€ cli.py           # Command-line interface
β”œβ”€β”€ extractor.py     # Core extraction logic
β”œβ”€β”€ exceptions.py    # Exception hierarchy
β”œβ”€β”€ schema.py        # Schema and aggregation SQL
β”œβ”€β”€ validators.py    # Three-tier validation
└── README.md        # This file

Error Handling

All errors inherit from DistinctLegislatorsError with detailed context:

  • SourceReadError: Source URL and error details
  • CompletenessError: Expected/actual counts, missing/extra IDs
  • AggregationError: bioguide_id, field name, expected/actual values
  • SampleValidationError: Same as above plus sample index
  • OutputWriteError: Output path and error details

HuggingFace Upload

After extraction, upload validated files to the processed data repository:

# Target location (processed data, not raw)
https://huggingface.co/datasets/Dustinhax/paper-trail-data

# Using HuggingFace CLI
hf upload Dustinhax/paper-trail-data distinct_legislators.parquet distinct_legislators.parquet --repo-type dataset

Note: Raw source data lives in Dustinhax/tyt. Processed/transformed data goes to Dustinhax/paper-trail-data.

Citation

When using this data, cite the original Voteview source:

Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche,
Aaron Rudkin, and Luke Sonnet. 2024. Voteview: Congressional
Roll-Call Votes Database. https://voteview.com/