Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Gold Tables Consolidation
Overview
The gold data directory has been consolidated from 86 files to 21 files (75% reduction) to simplify HuggingFace deployment and make the codebase easier to manage.
Changes Made
Before (86 files)
data/gold/
βββ national/
β βββ bills_map_aggregates.parquet
β βββ events.parquet
β βββ nonprofits_financials.parquet
β βββ nonprofits_locations.parquet
β βββ nonprofits_organizations.parquet
β βββ nonprofits_programs.parquet
βββ reference/
β βββ causes_everyorg_causes.parquet
β βββ causes_ntee_codes.parquet
β βββ domains_gsa_domains.parquet
β βββ jurisdictions_cities.parquet
β βββ jurisdictions_counties.parquet
β βββ jurisdictions_school_districts.parquet
β βββ jurisdictions_townships.parquet
β βββ zip_county_mapping.parquet
βββ states/
βββ AL/ (16 files)
βββ GA/ (16 files)
βββ IN/ (partial)
βββ MA/ (17 files)
βββ WA/ (16 files)
βββ WI/ (6 files)
After (21 files)
data/gold/
βββ bills_bill_actions.parquet (52 MB)
βββ bills_bill_sponsorships.parquet (39 MB)
βββ bills_bills.parquet (15 MB)
βββ bills_map_aggregates.parquet (142 KB)
βββ causes_everyorg_causes.parquet (11 KB)
βββ causes_ntee_codes.parquet (11 KB)
βββ contacts_local_officials.parquet (15 KB)
βββ contacts_officials.parquet (461 KB)
βββ domains_gsa_domains.parquet (596 KB)
βββ event_documents.parquet (366 MB)
βββ event_participants.parquet (808 KB)
βββ events.parquet (1.8 MB)
βββ jurisdictions_cities.parquet (2.0 MB)
βββ jurisdictions_counties.parquet (244 KB)
βββ jurisdictions_school_districts.parquet (926 KB)
βββ jurisdictions_townships.parquet (2.4 MB)
βββ nonprofits_financials.parquet (77 MB)
βββ nonprofits_locations.parquet (86 MB)
βββ nonprofits_organizations.parquet (134 MB)
βββ nonprofits_programs.parquet (65 MB)
βββ zip_county_mapping.parquet (323 KB)
Key Changes
1. State Data Consolidation
Before:
- Separate files per state:
data/gold/states/AL/bills_bills.parquet,data/gold/states/GA/bills_bills.parquet, etc. - Difficult to query across states
- Many small duplicate files
After:
- Single consolidated file:
data/gold/bills_bills.parquet - Contains
statecolumn for filtering - Easy to query across all states
2. API Code Updates
Old pattern:
for st in states:
parquet_path = Path(f"data/gold/states/{st}/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
# process...
New pattern:
parquet_path = Path("data/gold/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
if state:
df = df[df['state'] == state]
Files updated:
api/main.py- Updated opportunities endpoint to use consolidated billsapi/routes/stats.py- Updated stats endpoints for nonprofits, events, contacts
3. File Size Compliance
All files are under HuggingFace's 500MB recommended limit:
- Largest file:
event_documents.parquetat 366 MB - Total data size: ~840 MB
Benefits
- Simpler deployment - Fewer files to upload to HuggingFace
- Better queries - Can query across all states in single operation
- Easier maintenance - One file per table type instead of 5+ copies
- Cleaner codebase - Less path juggling in API code
- Faster reads - Read once instead of multiple times for multi-state queries
Scripts
Consolidation Script
# Consolidate state-partitioned files (already done)
python scripts/data/rebuild_consolidated_gold.py
# Dry run to preview
python scripts/data/rebuild_consolidated_gold.py --dry-run
Upload to HuggingFace
# Upload all consolidated files
python scripts/huggingface/upload_consolidated_gold.py
# Upload specific file
python scripts/huggingface/upload_consolidated_gold.py --file bills_bills.parquet
# Test with row limit
python scripts/huggingface/upload_consolidated_gold.py --max-rows 1000
# Skip large files
python scripts/huggingface/upload_consolidated_gold.py --skip-large
Querying Consolidated Data
Python
import pandas as pd
# Load consolidated bills data
df = pd.read_parquet('data/gold/bills_bills.parquet')
# Filter by state
ma_bills = df[df['state'] == 'MA']
# Query across multiple states
southern_bills = df[df['state'].isin(['AL', 'GA'])]
DuckDB
-- Query all bills
SELECT * FROM read_parquet('data/gold/bills_bills.parquet');
-- Filter by state
SELECT * FROM read_parquet('data/gold/bills_bills.parquet')
WHERE state = 'MA';
-- Aggregate across states
SELECT state, COUNT(*) as bill_count
FROM read_parquet('data/gold/bills_bills.parquet')
GROUP BY state;
Backup
The original state-partitioned structure is backed up in data/gold_old/ (not committed to git).
To restore if needed:
mv data/gold data/gold_consolidated
mv data/gold_old data/gold
Migration Notes
- β
All files include
statecolumn where applicable - β National and reference tables copied as-is
- β API code updated to use consolidated files
- β οΈ Example scripts in
examples/andscripts/enrichment/still reference old paths (low priority - for local dev only) - β οΈ Documentation files still show old paths (needs update)
Next Steps
- β Test API endpoints with consolidated data
- β³ Upload consolidated files to HuggingFace
- β³ Update documentation to reflect new structure
- β³ Update example scripts to use consolidated files
- β³ Deploy to production and verify