Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 5,911 Bytes
896453f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | # Gold Tables Consolidation
## Overview
The gold data directory has been consolidated from **86 files to 21 files** (75% reduction) to simplify HuggingFace deployment and make the codebase easier to manage.
## Changes Made
### Before (86 files)
```
data/gold/
βββ national/
β βββ bills_map_aggregates.parquet
β βββ events.parquet
β βββ nonprofits_financials.parquet
β βββ nonprofits_locations.parquet
β βββ nonprofits_organizations.parquet
β βββ nonprofits_programs.parquet
βββ reference/
β βββ causes_everyorg_causes.parquet
β βββ causes_ntee_codes.parquet
β βββ domains_gsa_domains.parquet
β βββ jurisdictions_cities.parquet
β βββ jurisdictions_counties.parquet
β βββ jurisdictions_school_districts.parquet
β βββ jurisdictions_townships.parquet
β βββ zip_county_mapping.parquet
βββ states/
βββ AL/ (16 files)
βββ GA/ (16 files)
βββ IN/ (partial)
βββ MA/ (17 files)
βββ WA/ (16 files)
βββ WI/ (6 files)
```
### After (21 files)
```
data/gold/
βββ bills_bill_actions.parquet (52 MB)
βββ bills_bill_sponsorships.parquet (39 MB)
βββ bills_bills.parquet (15 MB)
βββ bills_map_aggregates.parquet (142 KB)
βββ causes_everyorg_causes.parquet (11 KB)
βββ causes_ntee_codes.parquet (11 KB)
βββ contacts_local_officials.parquet (15 KB)
βββ contacts_officials.parquet (461 KB)
βββ domains_gsa_domains.parquet (596 KB)
βββ event_documents.parquet (366 MB)
βββ event_participants.parquet (808 KB)
βββ events.parquet (1.8 MB)
βββ jurisdictions_cities.parquet (2.0 MB)
βββ jurisdictions_counties.parquet (244 KB)
βββ jurisdictions_school_districts.parquet (926 KB)
βββ jurisdictions_townships.parquet (2.4 MB)
βββ nonprofits_financials.parquet (77 MB)
βββ nonprofits_locations.parquet (86 MB)
βββ nonprofits_organizations.parquet (134 MB)
βββ nonprofits_programs.parquet (65 MB)
βββ zip_county_mapping.parquet (323 KB)
```
## Key Changes
### 1. State Data Consolidation
**Before:**
- Separate files per state: `data/gold/states/AL/bills_bills.parquet`, `data/gold/states/GA/bills_bills.parquet`, etc.
- Difficult to query across states
- Many small duplicate files
**After:**
- Single consolidated file: `data/gold/bills_bills.parquet`
- Contains `state` column for filtering
- Easy to query across all states
### 2. API Code Updates
**Old pattern:**
```python
for st in states:
parquet_path = Path(f"data/gold/states/{st}/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
# process...
```
**New pattern:**
```python
parquet_path = Path("data/gold/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
if state:
df = df[df['state'] == state]
```
**Files updated:**
- `api/main.py` - Updated opportunities endpoint to use consolidated bills
- `api/routes/stats.py` - Updated stats endpoints for nonprofits, events, contacts
### 3. File Size Compliance
All files are under HuggingFace's 500MB recommended limit:
- Largest file: `event_documents.parquet` at 366 MB
- Total data size: ~840 MB
## Benefits
1. **Simpler deployment** - Fewer files to upload to HuggingFace
2. **Better queries** - Can query across all states in single operation
3. **Easier maintenance** - One file per table type instead of 5+ copies
4. **Cleaner codebase** - Less path juggling in API code
5. **Faster reads** - Read once instead of multiple times for multi-state queries
## Scripts
### Consolidation Script
```bash
# Consolidate state-partitioned files (already done)
python scripts/data/rebuild_consolidated_gold.py
# Dry run to preview
python scripts/data/rebuild_consolidated_gold.py --dry-run
```
### Upload to HuggingFace
```bash
# Upload all consolidated files
python scripts/huggingface/upload_consolidated_gold.py
# Upload specific file
python scripts/huggingface/upload_consolidated_gold.py --file bills_bills.parquet
# Test with row limit
python scripts/huggingface/upload_consolidated_gold.py --max-rows 1000
# Skip large files
python scripts/huggingface/upload_consolidated_gold.py --skip-large
```
## Querying Consolidated Data
### Python
```python
import pandas as pd
# Load consolidated bills data
df = pd.read_parquet('data/gold/bills_bills.parquet')
# Filter by state
ma_bills = df[df['state'] == 'MA']
# Query across multiple states
southern_bills = df[df['state'].isin(['AL', 'GA'])]
```
### DuckDB
```sql
-- Query all bills
SELECT * FROM read_parquet('data/gold/bills_bills.parquet');
-- Filter by state
SELECT * FROM read_parquet('data/gold/bills_bills.parquet')
WHERE state = 'MA';
-- Aggregate across states
SELECT state, COUNT(*) as bill_count
FROM read_parquet('data/gold/bills_bills.parquet')
GROUP BY state;
```
## Backup
The original state-partitioned structure is backed up in `data/gold_old/` (not committed to git).
To restore if needed:
```bash
mv data/gold data/gold_consolidated
mv data/gold_old data/gold
```
## Migration Notes
- β
All files include `state` column where applicable
- β
National and reference tables copied as-is
- β
API code updated to use consolidated files
- β οΈ Example scripts in `examples/` and `scripts/enrichment/` still reference old paths (low priority - for local dev only)
- β οΈ Documentation files still show old paths (needs update)
## Next Steps
1. β
Test API endpoints with consolidated data
2. β³ Upload consolidated files to HuggingFace
3. β³ Update documentation to reflect new structure
4. β³ Update example scripts to use consolidated files
5. β³ Deploy to production and verify
|