open-navigator / GOLD_CONSOLIDATION.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

Gold Tables Consolidation

Overview

The gold data directory has been consolidated from 86 files to 21 files (75% reduction) to simplify HuggingFace deployment and make the codebase easier to manage.

Changes Made

Before (86 files)

data/gold/
β”œβ”€β”€ national/
β”‚   β”œβ”€β”€ bills_map_aggregates.parquet
β”‚   β”œβ”€β”€ events.parquet
β”‚   β”œβ”€β”€ nonprofits_financials.parquet
β”‚   β”œβ”€β”€ nonprofits_locations.parquet
β”‚   β”œβ”€β”€ nonprofits_organizations.parquet
β”‚   └── nonprofits_programs.parquet
β”œβ”€β”€ reference/
β”‚   β”œβ”€β”€ causes_everyorg_causes.parquet
β”‚   β”œβ”€β”€ causes_ntee_codes.parquet
β”‚   β”œβ”€β”€ domains_gsa_domains.parquet
β”‚   β”œβ”€β”€ jurisdictions_cities.parquet
β”‚   β”œβ”€β”€ jurisdictions_counties.parquet
β”‚   β”œβ”€β”€ jurisdictions_school_districts.parquet
β”‚   β”œβ”€β”€ jurisdictions_townships.parquet
β”‚   └── zip_county_mapping.parquet
└── states/
    β”œβ”€β”€ AL/  (16 files)
    β”œβ”€β”€ GA/  (16 files)
    β”œβ”€β”€ IN/  (partial)
    β”œβ”€β”€ MA/  (17 files)
    β”œβ”€β”€ WA/  (16 files)
    └── WI/  (6 files)

After (21 files)

data/gold/
β”œβ”€β”€ bills_bill_actions.parquet          (52 MB)
β”œβ”€β”€ bills_bill_sponsorships.parquet     (39 MB)
β”œβ”€β”€ bills_bills.parquet                 (15 MB)
β”œβ”€β”€ bills_map_aggregates.parquet        (142 KB)
β”œβ”€β”€ causes_everyorg_causes.parquet      (11 KB)
β”œβ”€β”€ causes_ntee_codes.parquet           (11 KB)
β”œβ”€β”€ contacts_local_officials.parquet    (15 KB)
β”œβ”€β”€ contacts_officials.parquet          (461 KB)
β”œβ”€β”€ domains_gsa_domains.parquet         (596 KB)
β”œβ”€β”€ event_documents.parquet             (366 MB)
β”œβ”€β”€ event_participants.parquet          (808 KB)
β”œβ”€β”€ events.parquet                      (1.8 MB)
β”œβ”€β”€ jurisdictions_cities.parquet        (2.0 MB)
β”œβ”€β”€ jurisdictions_counties.parquet      (244 KB)
β”œβ”€β”€ jurisdictions_school_districts.parquet (926 KB)
β”œβ”€β”€ jurisdictions_townships.parquet     (2.4 MB)
β”œβ”€β”€ nonprofits_financials.parquet       (77 MB)
β”œβ”€β”€ nonprofits_locations.parquet        (86 MB)
β”œβ”€β”€ nonprofits_organizations.parquet    (134 MB)
β”œβ”€β”€ nonprofits_programs.parquet         (65 MB)
└── zip_county_mapping.parquet          (323 KB)

Key Changes

1. State Data Consolidation

Before:

  • Separate files per state: data/gold/states/AL/bills_bills.parquet, data/gold/states/GA/bills_bills.parquet, etc.
  • Difficult to query across states
  • Many small duplicate files

After:

  • Single consolidated file: data/gold/bills_bills.parquet
  • Contains state column for filtering
  • Easy to query across all states

2. API Code Updates

Old pattern:

for st in states:
    parquet_path = Path(f"data/gold/states/{st}/bills_bills.parquet")
    df = pd.read_parquet(parquet_path)
    # process...

New pattern:

parquet_path = Path("data/gold/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
if state:
    df = df[df['state'] == state]

Files updated:

  • api/main.py - Updated opportunities endpoint to use consolidated bills
  • api/routes/stats.py - Updated stats endpoints for nonprofits, events, contacts

3. File Size Compliance

All files are under HuggingFace's 500MB recommended limit:

  • Largest file: event_documents.parquet at 366 MB
  • Total data size: ~840 MB

Benefits

  1. Simpler deployment - Fewer files to upload to HuggingFace
  2. Better queries - Can query across all states in single operation
  3. Easier maintenance - One file per table type instead of 5+ copies
  4. Cleaner codebase - Less path juggling in API code
  5. Faster reads - Read once instead of multiple times for multi-state queries

Scripts

Consolidation Script

# Consolidate state-partitioned files (already done)
python scripts/data/rebuild_consolidated_gold.py

# Dry run to preview
python scripts/data/rebuild_consolidated_gold.py --dry-run

Upload to HuggingFace

# Upload all consolidated files
python scripts/huggingface/upload_consolidated_gold.py

# Upload specific file
python scripts/huggingface/upload_consolidated_gold.py --file bills_bills.parquet

# Test with row limit
python scripts/huggingface/upload_consolidated_gold.py --max-rows 1000

# Skip large files
python scripts/huggingface/upload_consolidated_gold.py --skip-large

Querying Consolidated Data

Python

import pandas as pd

# Load consolidated bills data
df = pd.read_parquet('data/gold/bills_bills.parquet')

# Filter by state
ma_bills = df[df['state'] == 'MA']

# Query across multiple states
southern_bills = df[df['state'].isin(['AL', 'GA'])]

DuckDB

-- Query all bills
SELECT * FROM read_parquet('data/gold/bills_bills.parquet');

-- Filter by state
SELECT * FROM read_parquet('data/gold/bills_bills.parquet')
WHERE state = 'MA';

-- Aggregate across states
SELECT state, COUNT(*) as bill_count
FROM read_parquet('data/gold/bills_bills.parquet')
GROUP BY state;

Backup

The original state-partitioned structure is backed up in data/gold_old/ (not committed to git).

To restore if needed:

mv data/gold data/gold_consolidated
mv data/gold_old data/gold

Migration Notes

  • βœ… All files include state column where applicable
  • βœ… National and reference tables copied as-is
  • βœ… API code updated to use consolidated files
  • ⚠️ Example scripts in examples/ and scripts/enrichment/ still reference old paths (low priority - for local dev only)
  • ⚠️ Documentation files still show old paths (needs update)

Next Steps

  1. βœ… Test API endpoints with consolidated data
  2. ⏳ Upload consolidated files to HuggingFace
  3. ⏳ Update documentation to reflect new structure
  4. ⏳ Update example scripts to use consolidated files
  5. ⏳ Deploy to production and verify