Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # Gold Tables Consolidation | |
| ## Overview | |
| The gold data directory has been consolidated from **86 files to 21 files** (75% reduction) to simplify HuggingFace deployment and make the codebase easier to manage. | |
| ## Changes Made | |
| ### Before (86 files) | |
| ``` | |
| data/gold/ | |
| βββ national/ | |
| β βββ bills_map_aggregates.parquet | |
| β βββ events.parquet | |
| β βββ nonprofits_financials.parquet | |
| β βββ nonprofits_locations.parquet | |
| β βββ nonprofits_organizations.parquet | |
| β βββ nonprofits_programs.parquet | |
| βββ reference/ | |
| β βββ causes_everyorg_causes.parquet | |
| β βββ causes_ntee_codes.parquet | |
| β βββ domains_gsa_domains.parquet | |
| β βββ jurisdictions_cities.parquet | |
| β βββ jurisdictions_counties.parquet | |
| β βββ jurisdictions_school_districts.parquet | |
| β βββ jurisdictions_townships.parquet | |
| β βββ zip_county_mapping.parquet | |
| βββ states/ | |
| βββ AL/ (16 files) | |
| βββ GA/ (16 files) | |
| βββ IN/ (partial) | |
| βββ MA/ (17 files) | |
| βββ WA/ (16 files) | |
| βββ WI/ (6 files) | |
| ``` | |
| ### After (21 files) | |
| ``` | |
| data/gold/ | |
| βββ bills_bill_actions.parquet (52 MB) | |
| βββ bills_bill_sponsorships.parquet (39 MB) | |
| βββ bills_bills.parquet (15 MB) | |
| βββ bills_map_aggregates.parquet (142 KB) | |
| βββ causes_everyorg_causes.parquet (11 KB) | |
| βββ causes_ntee_codes.parquet (11 KB) | |
| βββ contacts_local_officials.parquet (15 KB) | |
| βββ contacts_officials.parquet (461 KB) | |
| βββ domains_gsa_domains.parquet (596 KB) | |
| βββ event_documents.parquet (366 MB) | |
| βββ event_participants.parquet (808 KB) | |
| βββ events.parquet (1.8 MB) | |
| βββ jurisdictions_cities.parquet (2.0 MB) | |
| βββ jurisdictions_counties.parquet (244 KB) | |
| βββ jurisdictions_school_districts.parquet (926 KB) | |
| βββ jurisdictions_townships.parquet (2.4 MB) | |
| βββ nonprofits_financials.parquet (77 MB) | |
| βββ nonprofits_locations.parquet (86 MB) | |
| βββ nonprofits_organizations.parquet (134 MB) | |
| βββ nonprofits_programs.parquet (65 MB) | |
| βββ zip_county_mapping.parquet (323 KB) | |
| ``` | |
| ## Key Changes | |
| ### 1. State Data Consolidation | |
| **Before:** | |
| - Separate files per state: `data/gold/states/AL/bills_bills.parquet`, `data/gold/states/GA/bills_bills.parquet`, etc. | |
| - Difficult to query across states | |
| - Many small duplicate files | |
| **After:** | |
| - Single consolidated file: `data/gold/bills_bills.parquet` | |
| - Contains `state` column for filtering | |
| - Easy to query across all states | |
| ### 2. API Code Updates | |
| **Old pattern:** | |
| ```python | |
| for st in states: | |
| parquet_path = Path(f"data/gold/states/{st}/bills_bills.parquet") | |
| df = pd.read_parquet(parquet_path) | |
| # process... | |
| ``` | |
| **New pattern:** | |
| ```python | |
| parquet_path = Path("data/gold/bills_bills.parquet") | |
| df = pd.read_parquet(parquet_path) | |
| if state: | |
| df = df[df['state'] == state] | |
| ``` | |
| **Files updated:** | |
| - `api/main.py` - Updated opportunities endpoint to use consolidated bills | |
| - `api/routes/stats.py` - Updated stats endpoints for nonprofits, events, contacts | |
| ### 3. File Size Compliance | |
| All files are under HuggingFace's 500MB recommended limit: | |
| - Largest file: `event_documents.parquet` at 366 MB | |
| - Total data size: ~840 MB | |
| ## Benefits | |
| 1. **Simpler deployment** - Fewer files to upload to HuggingFace | |
| 2. **Better queries** - Can query across all states in single operation | |
| 3. **Easier maintenance** - One file per table type instead of 5+ copies | |
| 4. **Cleaner codebase** - Less path juggling in API code | |
| 5. **Faster reads** - Read once instead of multiple times for multi-state queries | |
| ## Scripts | |
| ### Consolidation Script | |
| ```bash | |
| # Consolidate state-partitioned files (already done) | |
| python scripts/data/rebuild_consolidated_gold.py | |
| # Dry run to preview | |
| python scripts/data/rebuild_consolidated_gold.py --dry-run | |
| ``` | |
| ### Upload to HuggingFace | |
| ```bash | |
| # Upload all consolidated files | |
| python scripts/huggingface/upload_consolidated_gold.py | |
| # Upload specific file | |
| python scripts/huggingface/upload_consolidated_gold.py --file bills_bills.parquet | |
| # Test with row limit | |
| python scripts/huggingface/upload_consolidated_gold.py --max-rows 1000 | |
| # Skip large files | |
| python scripts/huggingface/upload_consolidated_gold.py --skip-large | |
| ``` | |
| ## Querying Consolidated Data | |
| ### Python | |
| ```python | |
| import pandas as pd | |
| # Load consolidated bills data | |
| df = pd.read_parquet('data/gold/bills_bills.parquet') | |
| # Filter by state | |
| ma_bills = df[df['state'] == 'MA'] | |
| # Query across multiple states | |
| southern_bills = df[df['state'].isin(['AL', 'GA'])] | |
| ``` | |
| ### DuckDB | |
| ```sql | |
| -- Query all bills | |
| SELECT * FROM read_parquet('data/gold/bills_bills.parquet'); | |
| -- Filter by state | |
| SELECT * FROM read_parquet('data/gold/bills_bills.parquet') | |
| WHERE state = 'MA'; | |
| -- Aggregate across states | |
| SELECT state, COUNT(*) as bill_count | |
| FROM read_parquet('data/gold/bills_bills.parquet') | |
| GROUP BY state; | |
| ``` | |
| ## Backup | |
| The original state-partitioned structure is backed up in `data/gold_old/` (not committed to git). | |
| To restore if needed: | |
| ```bash | |
| mv data/gold data/gold_consolidated | |
| mv data/gold_old data/gold | |
| ``` | |
| ## Migration Notes | |
| - β All files include `state` column where applicable | |
| - β National and reference tables copied as-is | |
| - β API code updated to use consolidated files | |
| - β οΈ Example scripts in `examples/` and `scripts/enrichment/` still reference old paths (low priority - for local dev only) | |
| - β οΈ Documentation files still show old paths (needs update) | |
| ## Next Steps | |
| 1. β Test API endpoints with consolidated data | |
| 2. β³ Upload consolidated files to HuggingFace | |
| 3. β³ Update documentation to reflect new structure | |
| 4. β³ Update example scripts to use consolidated files | |
| 5. β³ Deploy to production and verify | |