open-navigator / GOLD_CONSOLIDATION.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# Gold Tables Consolidation
## Overview
The gold data directory has been consolidated from **86 files to 21 files** (75% reduction) to simplify HuggingFace deployment and make the codebase easier to manage.
## Changes Made
### Before (86 files)
```
data/gold/
β”œβ”€β”€ national/
β”‚ β”œβ”€β”€ bills_map_aggregates.parquet
β”‚ β”œβ”€β”€ events.parquet
β”‚ β”œβ”€β”€ nonprofits_financials.parquet
β”‚ β”œβ”€β”€ nonprofits_locations.parquet
β”‚ β”œβ”€β”€ nonprofits_organizations.parquet
β”‚ └── nonprofits_programs.parquet
β”œβ”€β”€ reference/
β”‚ β”œβ”€β”€ causes_everyorg_causes.parquet
β”‚ β”œβ”€β”€ causes_ntee_codes.parquet
β”‚ β”œβ”€β”€ domains_gsa_domains.parquet
β”‚ β”œβ”€β”€ jurisdictions_cities.parquet
β”‚ β”œβ”€β”€ jurisdictions_counties.parquet
β”‚ β”œβ”€β”€ jurisdictions_school_districts.parquet
β”‚ β”œβ”€β”€ jurisdictions_townships.parquet
β”‚ └── zip_county_mapping.parquet
└── states/
β”œβ”€β”€ AL/ (16 files)
β”œβ”€β”€ GA/ (16 files)
β”œβ”€β”€ IN/ (partial)
β”œβ”€β”€ MA/ (17 files)
β”œβ”€β”€ WA/ (16 files)
└── WI/ (6 files)
```
### After (21 files)
```
data/gold/
β”œβ”€β”€ bills_bill_actions.parquet (52 MB)
β”œβ”€β”€ bills_bill_sponsorships.parquet (39 MB)
β”œβ”€β”€ bills_bills.parquet (15 MB)
β”œβ”€β”€ bills_map_aggregates.parquet (142 KB)
β”œβ”€β”€ causes_everyorg_causes.parquet (11 KB)
β”œβ”€β”€ causes_ntee_codes.parquet (11 KB)
β”œβ”€β”€ contacts_local_officials.parquet (15 KB)
β”œβ”€β”€ contacts_officials.parquet (461 KB)
β”œβ”€β”€ domains_gsa_domains.parquet (596 KB)
β”œβ”€β”€ event_documents.parquet (366 MB)
β”œβ”€β”€ event_participants.parquet (808 KB)
β”œβ”€β”€ events.parquet (1.8 MB)
β”œβ”€β”€ jurisdictions_cities.parquet (2.0 MB)
β”œβ”€β”€ jurisdictions_counties.parquet (244 KB)
β”œβ”€β”€ jurisdictions_school_districts.parquet (926 KB)
β”œβ”€β”€ jurisdictions_townships.parquet (2.4 MB)
β”œβ”€β”€ nonprofits_financials.parquet (77 MB)
β”œβ”€β”€ nonprofits_locations.parquet (86 MB)
β”œβ”€β”€ nonprofits_organizations.parquet (134 MB)
β”œβ”€β”€ nonprofits_programs.parquet (65 MB)
└── zip_county_mapping.parquet (323 KB)
```
## Key Changes
### 1. State Data Consolidation
**Before:**
- Separate files per state: `data/gold/states/AL/bills_bills.parquet`, `data/gold/states/GA/bills_bills.parquet`, etc.
- Difficult to query across states
- Many small duplicate files
**After:**
- Single consolidated file: `data/gold/bills_bills.parquet`
- Contains `state` column for filtering
- Easy to query across all states
### 2. API Code Updates
**Old pattern:**
```python
for st in states:
parquet_path = Path(f"data/gold/states/{st}/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
# process...
```
**New pattern:**
```python
parquet_path = Path("data/gold/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
if state:
df = df[df['state'] == state]
```
**Files updated:**
- `api/main.py` - Updated opportunities endpoint to use consolidated bills
- `api/routes/stats.py` - Updated stats endpoints for nonprofits, events, contacts
### 3. File Size Compliance
All files are under HuggingFace's 500MB recommended limit:
- Largest file: `event_documents.parquet` at 366 MB
- Total data size: ~840 MB
## Benefits
1. **Simpler deployment** - Fewer files to upload to HuggingFace
2. **Better queries** - Can query across all states in single operation
3. **Easier maintenance** - One file per table type instead of 5+ copies
4. **Cleaner codebase** - Less path juggling in API code
5. **Faster reads** - Read once instead of multiple times for multi-state queries
## Scripts
### Consolidation Script
```bash
# Consolidate state-partitioned files (already done)
python scripts/data/rebuild_consolidated_gold.py
# Dry run to preview
python scripts/data/rebuild_consolidated_gold.py --dry-run
```
### Upload to HuggingFace
```bash
# Upload all consolidated files
python scripts/huggingface/upload_consolidated_gold.py
# Upload specific file
python scripts/huggingface/upload_consolidated_gold.py --file bills_bills.parquet
# Test with row limit
python scripts/huggingface/upload_consolidated_gold.py --max-rows 1000
# Skip large files
python scripts/huggingface/upload_consolidated_gold.py --skip-large
```
## Querying Consolidated Data
### Python
```python
import pandas as pd
# Load consolidated bills data
df = pd.read_parquet('data/gold/bills_bills.parquet')
# Filter by state
ma_bills = df[df['state'] == 'MA']
# Query across multiple states
southern_bills = df[df['state'].isin(['AL', 'GA'])]
```
### DuckDB
```sql
-- Query all bills
SELECT * FROM read_parquet('data/gold/bills_bills.parquet');
-- Filter by state
SELECT * FROM read_parquet('data/gold/bills_bills.parquet')
WHERE state = 'MA';
-- Aggregate across states
SELECT state, COUNT(*) as bill_count
FROM read_parquet('data/gold/bills_bills.parquet')
GROUP BY state;
```
## Backup
The original state-partitioned structure is backed up in `data/gold_old/` (not committed to git).
To restore if needed:
```bash
mv data/gold data/gold_consolidated
mv data/gold_old data/gold
```
## Migration Notes
- βœ… All files include `state` column where applicable
- βœ… National and reference tables copied as-is
- βœ… API code updated to use consolidated files
- ⚠️ Example scripts in `examples/` and `scripts/enrichment/` still reference old paths (low priority - for local dev only)
- ⚠️ Documentation files still show old paths (needs update)
## Next Steps
1. βœ… Test API endpoints with consolidated data
2. ⏳ Upload consolidated files to HuggingFace
3. ⏳ Update documentation to reflect new structure
4. ⏳ Update example scripts to use consolidated files
5. ⏳ Deploy to production and verify