File size: 5,911 Bytes
896453f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# Gold Tables Consolidation

## Overview

The gold data directory has been consolidated from **86 files to 21 files** (75% reduction) to simplify HuggingFace deployment and make the codebase easier to manage.

## Changes Made

### Before (86 files)
```
data/gold/
β”œβ”€β”€ national/
β”‚   β”œβ”€β”€ bills_map_aggregates.parquet
β”‚   β”œβ”€β”€ events.parquet
β”‚   β”œβ”€β”€ nonprofits_financials.parquet
β”‚   β”œβ”€β”€ nonprofits_locations.parquet
β”‚   β”œβ”€β”€ nonprofits_organizations.parquet
β”‚   └── nonprofits_programs.parquet
β”œβ”€β”€ reference/
β”‚   β”œβ”€β”€ causes_everyorg_causes.parquet
β”‚   β”œβ”€β”€ causes_ntee_codes.parquet
β”‚   β”œβ”€β”€ domains_gsa_domains.parquet
β”‚   β”œβ”€β”€ jurisdictions_cities.parquet
β”‚   β”œβ”€β”€ jurisdictions_counties.parquet
β”‚   β”œβ”€β”€ jurisdictions_school_districts.parquet
β”‚   β”œβ”€β”€ jurisdictions_townships.parquet
β”‚   └── zip_county_mapping.parquet
└── states/
    β”œβ”€β”€ AL/  (16 files)
    β”œβ”€β”€ GA/  (16 files)
    β”œβ”€β”€ IN/  (partial)
    β”œβ”€β”€ MA/  (17 files)
    β”œβ”€β”€ WA/  (16 files)
    └── WI/  (6 files)
```

### After (21 files)
```
data/gold/
β”œβ”€β”€ bills_bill_actions.parquet          (52 MB)
β”œβ”€β”€ bills_bill_sponsorships.parquet     (39 MB)
β”œβ”€β”€ bills_bills.parquet                 (15 MB)
β”œβ”€β”€ bills_map_aggregates.parquet        (142 KB)
β”œβ”€β”€ causes_everyorg_causes.parquet      (11 KB)
β”œβ”€β”€ causes_ntee_codes.parquet           (11 KB)
β”œβ”€β”€ contacts_local_officials.parquet    (15 KB)
β”œβ”€β”€ contacts_officials.parquet          (461 KB)
β”œβ”€β”€ domains_gsa_domains.parquet         (596 KB)
β”œβ”€β”€ event_documents.parquet             (366 MB)
β”œβ”€β”€ event_participants.parquet          (808 KB)
β”œβ”€β”€ events.parquet                      (1.8 MB)
β”œβ”€β”€ jurisdictions_cities.parquet        (2.0 MB)
β”œβ”€β”€ jurisdictions_counties.parquet      (244 KB)
β”œβ”€β”€ jurisdictions_school_districts.parquet (926 KB)
β”œβ”€β”€ jurisdictions_townships.parquet     (2.4 MB)
β”œβ”€β”€ nonprofits_financials.parquet       (77 MB)
β”œβ”€β”€ nonprofits_locations.parquet        (86 MB)
β”œβ”€β”€ nonprofits_organizations.parquet    (134 MB)
β”œβ”€β”€ nonprofits_programs.parquet         (65 MB)
└── zip_county_mapping.parquet          (323 KB)
```

## Key Changes

### 1. State Data Consolidation

**Before:**
- Separate files per state: `data/gold/states/AL/bills_bills.parquet`, `data/gold/states/GA/bills_bills.parquet`, etc.
- Difficult to query across states
- Many small duplicate files

**After:**
- Single consolidated file: `data/gold/bills_bills.parquet`
- Contains `state` column for filtering
- Easy to query across all states

### 2. API Code Updates

**Old pattern:**
```python
for st in states:
    parquet_path = Path(f"data/gold/states/{st}/bills_bills.parquet")
    df = pd.read_parquet(parquet_path)
    # process...
```

**New pattern:**
```python
parquet_path = Path("data/gold/bills_bills.parquet")
df = pd.read_parquet(parquet_path)
if state:
    df = df[df['state'] == state]
```

**Files updated:**
- `api/main.py` - Updated opportunities endpoint to use consolidated bills
- `api/routes/stats.py` - Updated stats endpoints for nonprofits, events, contacts

### 3. File Size Compliance

All files are under HuggingFace's 500MB recommended limit:
- Largest file: `event_documents.parquet` at 366 MB
- Total data size: ~840 MB

## Benefits

1. **Simpler deployment** - Fewer files to upload to HuggingFace
2. **Better queries** - Can query across all states in single operation
3. **Easier maintenance** - One file per table type instead of 5+ copies
4. **Cleaner codebase** - Less path juggling in API code
5. **Faster reads** - Read once instead of multiple times for multi-state queries

## Scripts

### Consolidation Script
```bash
# Consolidate state-partitioned files (already done)
python scripts/data/rebuild_consolidated_gold.py

# Dry run to preview
python scripts/data/rebuild_consolidated_gold.py --dry-run
```

### Upload to HuggingFace
```bash
# Upload all consolidated files
python scripts/huggingface/upload_consolidated_gold.py

# Upload specific file
python scripts/huggingface/upload_consolidated_gold.py --file bills_bills.parquet

# Test with row limit
python scripts/huggingface/upload_consolidated_gold.py --max-rows 1000

# Skip large files
python scripts/huggingface/upload_consolidated_gold.py --skip-large
```

## Querying Consolidated Data

### Python
```python
import pandas as pd

# Load consolidated bills data
df = pd.read_parquet('data/gold/bills_bills.parquet')

# Filter by state
ma_bills = df[df['state'] == 'MA']

# Query across multiple states
southern_bills = df[df['state'].isin(['AL', 'GA'])]
```

### DuckDB
```sql
-- Query all bills
SELECT * FROM read_parquet('data/gold/bills_bills.parquet');

-- Filter by state
SELECT * FROM read_parquet('data/gold/bills_bills.parquet')
WHERE state = 'MA';

-- Aggregate across states
SELECT state, COUNT(*) as bill_count
FROM read_parquet('data/gold/bills_bills.parquet')
GROUP BY state;
```

## Backup

The original state-partitioned structure is backed up in `data/gold_old/` (not committed to git).

To restore if needed:
```bash
mv data/gold data/gold_consolidated
mv data/gold_old data/gold
```

## Migration Notes

- βœ… All files include `state` column where applicable
- βœ… National and reference tables copied as-is
- βœ… API code updated to use consolidated files
- ⚠️ Example scripts in `examples/` and `scripts/enrichment/` still reference old paths (low priority - for local dev only)
- ⚠️ Documentation files still show old paths (needs update)

## Next Steps

1. βœ… Test API endpoints with consolidated data
2. ⏳ Upload consolidated files to HuggingFace
3. ⏳ Update documentation to reflect new structure
4. ⏳ Update example scripts to use consolidated files
5. ⏳ Deploy to production and verify