Hoe
Deploying Backend API
b339b93
# Distinct Legislators Extractor
Extracts a deduplicated list of legislators from Voteview's HSall_members data, aggregating congress sessions served for each legislator (96th congress onward = 1979+).
## Source Data
- **Source:** Voteview HSall_members.parquet
- **URL:** HuggingFace dataset `Dustinhax/tyt`
- **Original structure:** One row per legislator per congress session
- **Output structure:** One row per legislator (aggregated)
### Filtering
- **Congress filter:** >= 96 (1979-1980 onward)
- **Null filter:** `bioguide_id IS NOT NULL` (excludes ~17 records without bioguide)
- These are typically Presidents or historical members without bioguide IDs
## Installation
Requires Python 3.13+ and the following dependencies:
```bash
uv pip install duckdb pyarrow
```
## Usage
### Command Line
```bash
# Basic usage
python -m distinct_legislators legislators.parquet
# Custom congress filter
python -m distinct_legislators legislators.parquet --min-congress 100
# Skip validation (not recommended)
python -m distinct_legislators legislators.parquet --no-validate
# Custom sample sizes for validation
python -m distinct_legislators legislators.parquet --aggregation-sample 200 --deep-sample 100
# Reproducible validation with fixed random seed
python -m distinct_legislators legislators.parquet --seed 42
```
### Python API
```python
from distinct_legislators import extract_distinct_legislators
# Extract with default settings
result = extract_distinct_legislators("legislators.parquet")
print(f"Extracted {result.output_count:,} legislators")
print(f"Validation passed: {result.validation.all_valid}")
# Custom options
result = extract_distinct_legislators(
"legislators.parquet",
min_congress=100, # 100th congress (1987) onward
aggregation_sample_size=200,
deep_sample_size=100,
)
```
## Output Schema
| Column | Type | Description |
|--------|------|-------------|
| bioguide_id | VARCHAR | Primary key (unique per legislator) |
| bioname | VARCHAR | Full name from Voteview (most recent) |
| state_abbrev | VARCHAR | Most recent state represented |
| party_code | DOUBLE | Most recent party (100=Dem, 200=Rep) |
| congresses_served | INT16[] | Array of all congress numbers served |
| first_congress | INT16 | Earliest congress (MIN) for range queries |
| last_congress | INT16 | Latest congress (MAX) for range queries |
| nominate_dim1 | DOUBLE | Economic liberalism/conservatism score |
| nominate_dim2 | DOUBLE | Social issues score |
### Congress Numbers to Years
Congress N covers years `(1787 + 2*N)` to `(1788 + 2*N)`:
- Congress 96 = 1979-1980
- Congress 100 = 1987-1988
- Congress 119 = 2025-2026
## Data Interpretation Decisions
### Aggregation Rules
| Field | Aggregation | Rationale |
|-------|-------------|-----------|
| bioguide_id | GROUP BY | Primary key, unique per legislator |
| bioname | LAST by congress | Name format may change; use most recent |
| state_abbrev | LAST by congress | Legislators may change states (rare) |
| party_code | LAST by congress | Party affiliation may change over career |
| congresses_served | LIST ordered | Complete history of all sessions served |
| first_congress | MIN | Earliest congress served (career start) |
| last_congress | MAX | Latest congress served (current or end) |
| nominate_dim1 | LAST by congress | Ideology score from most recent session |
| nominate_dim2 | LAST by congress | Ideology score from most recent session |
### Party Codes
| Code | Party |
|------|-------|
| 100 | Democrat |
| 200 | Republican |
| 328 | Independent |
| Other | Historical parties (Whig, Federalist, etc.) |
### NOMINATE Scores
- **dim1:** Economic liberalism/conservatism (-1 to +1, negative=liberal)
- **dim2:** Social issues/civil rights (-1 to +1, interpretation varies by era)
- Scores are session-specific; we keep the most recent for simplicity
### Known Edge Cases
1. **Party switchers:** Uses most recent party (e.g., Arlen Specter shows Democrat)
2. **State changers:** Uses most recent state (rare, but possible)
3. **Gaps in service:** `congresses_served` array handles non-consecutive terms
4. **Presidents:** Excluded (no bioguide_id in Voteview data)
## Validation
Three-tier validation ensures correct aggregation:
### Tier 1: Completeness
Verifies every source bioguide_id appears exactly once in output:
- Output count matches distinct source count
- No missing bioguide_ids
- No extra bioguide_ids
- No duplicates in output
### Tier 2: Aggregation Integrity
Randomly samples legislators and verifies:
- `first_congress` = MIN(congress) from source
- `last_congress` = MAX(congress) from source
- `congresses_served` array length matches source row count
### Tier 3: Sample Verification
Deep validation of random legislators:
- `congresses_served` array contains exactly the right congress numbers
- `bioname` matches the most recent congress entry
- `state_abbrev` matches the most recent congress entry
## Technical Details
### Compression
- Algorithm: ZSTD (Zstandard)
- Typical output size: ~53 KB for ~2,300 legislators
### Query Engine
- Uses DuckDB for efficient remote Parquet reading
- Direct URL access to HuggingFace dataset
## Module Structure
```
scripts/distinct_legislators/
β”œβ”€β”€ __init__.py # Public API exports
β”œβ”€β”€ __main__.py # Module entry point
β”œβ”€β”€ cli.py # Command-line interface
β”œβ”€β”€ extractor.py # Core extraction logic
β”œβ”€β”€ exceptions.py # Exception hierarchy
β”œβ”€β”€ schema.py # Schema and aggregation SQL
β”œβ”€β”€ validators.py # Three-tier validation
└── README.md # This file
```
## Error Handling
All errors inherit from `DistinctLegislatorsError` with detailed context:
- `SourceReadError`: Source URL and error details
- `CompletenessError`: Expected/actual counts, missing/extra IDs
- `AggregationError`: bioguide_id, field name, expected/actual values
- `SampleValidationError`: Same as above plus sample index
- `OutputWriteError`: Output path and error details
## HuggingFace Upload
After extraction, upload validated files to the processed data repository:
```bash
# Target location (processed data, not raw)
https://huggingface.co/datasets/Dustinhax/paper-trail-data
# Using HuggingFace CLI
hf upload Dustinhax/paper-trail-data distinct_legislators.parquet distinct_legislators.parquet --repo-type dataset
```
Note: Raw source data lives in `Dustinhax/tyt`. Processed/transformed data goes to `Dustinhax/paper-trail-data`.
## Citation
When using this data, cite the original Voteview source:
```
Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche,
Aaron Rudkin, and Luke Sonnet. 2024. Voteview: Congressional
Roll-Call Votes Database. https://voteview.com/
```