Spaces:

Hoecat
/

paper-trail-api

Runtime error

App Files Files Community

paper-trail-api / scripts /distinct_legislators /README.md

Hoe

Deploying Backend API

b339b93 2 months ago

preview code

raw

history blame contribute delete

6.96 kB

	# Distinct Legislators Extractor

	Extracts a deduplicated list of legislators from Voteview's HSall_members data, aggregating congress sessions served for each legislator (96th congress onward = 1979+).

	## Source Data

	- Source: Voteview HSall_members.parquet
	- URL: HuggingFace dataset `Dustinhax/tyt`
	- Original structure: One row per legislator per congress session
	- Output structure: One row per legislator (aggregated)

	### Filtering

	- Congress filter: >= 96 (1979-1980 onward)
	- Null filter: `bioguide_id IS NOT NULL` (excludes ~17 records without bioguide)
	- These are typically Presidents or historical members without bioguide IDs

	## Installation

	Requires Python 3.13+ and the following dependencies:

	```bash
	uv pip install duckdb pyarrow
	```

	## Usage

	### Command Line

	```bash
	# Basic usage
	python -m distinct_legislators legislators.parquet

	# Custom congress filter
	python -m distinct_legislators legislators.parquet --min-congress 100

	# Skip validation (not recommended)
	python -m distinct_legislators legislators.parquet --no-validate

	# Custom sample sizes for validation
	python -m distinct_legislators legislators.parquet --aggregation-sample 200 --deep-sample 100

	# Reproducible validation with fixed random seed
	python -m distinct_legislators legislators.parquet --seed 42
	```

	### Python API

	```python
	from distinct_legislators import extract_distinct_legislators

	# Extract with default settings
	result = extract_distinct_legislators("legislators.parquet")
	print(f"Extracted {result.output_count:,} legislators")
	print(f"Validation passed: {result.validation.all_valid}")

	# Custom options
	result = extract_distinct_legislators(
	"legislators.parquet",
	min_congress=100, # 100th congress (1987) onward
	aggregation_sample_size=200,
	deep_sample_size=100,
	)
	```

	## Output Schema

	\| Column \| Type \| Description \|
	\|--------\|------\|-------------\|
	\| bioguide_id \| VARCHAR \| Primary key (unique per legislator) \|
	\| bioname \| VARCHAR \| Full name from Voteview (most recent) \|
	\| state_abbrev \| VARCHAR \| Most recent state represented \|
	\| party_code \| DOUBLE \| Most recent party (100=Dem, 200=Rep) \|
	\| congresses_served \| INT16[] \| Array of all congress numbers served \|
	\| first_congress \| INT16 \| Earliest congress (MIN) for range queries \|
	\| last_congress \| INT16 \| Latest congress (MAX) for range queries \|
	\| nominate_dim1 \| DOUBLE \| Economic liberalism/conservatism score \|
	\| nominate_dim2 \| DOUBLE \| Social issues score \|

	### Congress Numbers to Years

	Congress N covers years `(1787 + 2N)` to `(1788 + 2N)`:
	- Congress 96 = 1979-1980
	- Congress 100 = 1987-1988
	- Congress 119 = 2025-2026

	## Data Interpretation Decisions

	### Aggregation Rules

	\| Field \| Aggregation \| Rationale \|
	\|-------\|-------------\|-----------\|
	\| bioguide_id \| GROUP BY \| Primary key, unique per legislator \|
	\| bioname \| LAST by congress \| Name format may change; use most recent \|
	\| state_abbrev \| LAST by congress \| Legislators may change states (rare) \|
	\| party_code \| LAST by congress \| Party affiliation may change over career \|
	\| congresses_served \| LIST ordered \| Complete history of all sessions served \|
	\| first_congress \| MIN \| Earliest congress served (career start) \|
	\| last_congress \| MAX \| Latest congress served (current or end) \|
	\| nominate_dim1 \| LAST by congress \| Ideology score from most recent session \|
	\| nominate_dim2 \| LAST by congress \| Ideology score from most recent session \|

	### Party Codes

	\| Code \| Party \|
	\|------\|-------\|
	\| 100 \| Democrat \|
	\| 200 \| Republican \|
	\| 328 \| Independent \|
	\| Other \| Historical parties (Whig, Federalist, etc.) \|

	### NOMINATE Scores

	- dim1: Economic liberalism/conservatism (-1 to +1, negative=liberal)
	- dim2: Social issues/civil rights (-1 to +1, interpretation varies by era)
	- Scores are session-specific; we keep the most recent for simplicity

	### Known Edge Cases

	1. Party switchers: Uses most recent party (e.g., Arlen Specter shows Democrat)
	2. State changers: Uses most recent state (rare, but possible)
	3. Gaps in service: `congresses_served` array handles non-consecutive terms
	4. Presidents: Excluded (no bioguide_id in Voteview data)

	## Validation

	Three-tier validation ensures correct aggregation:

	### Tier 1: Completeness

	Verifies every source bioguide_id appears exactly once in output:
	- Output count matches distinct source count
	- No missing bioguide_ids
	- No extra bioguide_ids
	- No duplicates in output

	### Tier 2: Aggregation Integrity

	Randomly samples legislators and verifies:
	- `first_congress` = MIN(congress) from source
	- `last_congress` = MAX(congress) from source
	- `congresses_served` array length matches source row count

	### Tier 3: Sample Verification

	Deep validation of random legislators:
	- `congresses_served` array contains exactly the right congress numbers
	- `bioname` matches the most recent congress entry
	- `state_abbrev` matches the most recent congress entry

	## Technical Details

	### Compression

	- Algorithm: ZSTD (Zstandard)
	- Typical output size: ~53 KB for ~2,300 legislators

	### Query Engine

	- Uses DuckDB for efficient remote Parquet reading
	- Direct URL access to HuggingFace dataset

	## Module Structure

	```
	scripts/distinct_legislators/
	├── __init__.py # Public API exports
	├── __main__.py # Module entry point
	├── cli.py # Command-line interface
	├── extractor.py # Core extraction logic
	├── exceptions.py # Exception hierarchy
	├── schema.py # Schema and aggregation SQL
	├── validators.py # Three-tier validation
	└── README.md # This file
	```

	## Error Handling

	All errors inherit from `DistinctLegislatorsError` with detailed context:

	- `SourceReadError`: Source URL and error details
	- `CompletenessError`: Expected/actual counts, missing/extra IDs
	- `AggregationError`: bioguide_id, field name, expected/actual values
	- `SampleValidationError`: Same as above plus sample index
	- `OutputWriteError`: Output path and error details

	## HuggingFace Upload

	After extraction, upload validated files to the processed data repository:

	```bash
	# Target location (processed data, not raw)
	https://huggingface.co/datasets/Dustinhax/paper-trail-data

	# Using HuggingFace CLI
	hf upload Dustinhax/paper-trail-data distinct_legislators.parquet distinct_legislators.parquet --repo-type dataset
	```

	Note: Raw source data lives in `Dustinhax/tyt`. Processed/transformed data goes to `Dustinhax/paper-trail-data`.

	## Citation

	When using this data, cite the original Voteview source:

	```
	Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche,
	Aaron Rudkin, and Luke Sonnet. 2024. Voteview: Congressional
	Roll-Call Votes Database. https://voteview.com/
	```