File size: 4,973 Bytes
61d29fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
sidebar_position: 5
---

# State-Split Data Files (Deprecated)

:::warning Deprecated
This approach of splitting files into separate state files is **deprecated**. 

Use **[Partitioned Datasets](./partitioned-datasets.md)** instead for:
- Same efficiency as separate files
- Ability to query across states
- Better analytics tool support
- Simpler data management
:::

All gold parquet files with state information were previously split into state-specific files. This has been replaced by partitioned datasets which offer the same benefits with better queryability.

## What Changed

Instead of downloading one massive file with all states:
- ❌ `nonprofits_organizations.parquet` (72 MB, 1.9M records)

You can now download just the state(s) you need:
- βœ… `nonprofits_organizations_AL.parquet` (Alabama only, ~1 MB)
- βœ… `nonprofits_organizations_CA.parquet` (California only, ~8 MB)
- βœ… `nonprofits_organizations_TX.parquet` (Texas only, ~6 MB)

## Benefits

1. **Smaller Downloads**: Only download the data you need
2. **Faster Queries**: Load and analyze state-specific data faster
3. **Better Organization**: Easier to manage and share state-level datasets
4. **HuggingFace Friendly**: Avoids file size limits, enables state-specific repos

## File Structure

State-split files are located in `data/gold/by_state/`:

```
data/gold/by_state/
β”œβ”€β”€ nonprofits_organizations_AL.parquet
β”œβ”€β”€ nonprofits_organizations_AK.parquet
β”œβ”€β”€ nonprofits_locations_AL.parquet
β”œβ”€β”€ jurisdictions_cities_AL.parquet
β”œβ”€β”€ jurisdictions_counties_AL.parquet
β”œβ”€β”€ jurisdictions_school_districts_AL.parquet
└── ... (388 total files)
```

## Files That Were Split

### Nonprofit Data (62 states/territories each)
- `nonprofits_organizations_*.parquet` - Organization details
- `nonprofits_locations_*.parquet` - Geographic locations

### Jurisdiction Data (52 states each)
- `jurisdictions_cities_*.parquet` - Cities and municipalities
- `jurisdictions_counties_*.parquet` - Counties
- `jurisdictions_school_districts_*.parquet` - School districts
- `jurisdictions_townships_*.parquet` - Townships

### Other Data (56 states each)
- `domains_gsa_domains_*.parquet` - Government domains

## Usage

### Load Alabama Nonprofits
```python
import pandas as pd

# Load only Alabama data
df = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet')
print(f"Alabama nonprofits: {len(df):,}")
```

### Load Multiple States
```python
import pandas as pd
from pathlib import Path

# Load all southeastern states
states = ['AL', 'GA', 'FL', 'MS', 'TN', 'SC', 'NC']
dfs = []

for state in states:
    path = f'data/gold/by_state/nonprofits_organizations_{state}.parquet'
    df = pd.read_parquet(path)
    dfs.append(df)

# Combine into one DataFrame
southeast = pd.concat(dfs, ignore_index=True)
print(f"Southeast nonprofits: {len(southeast):,}")
```

### Recreate Full Dataset
```python
import pandas as pd
from pathlib import Path

# Load all nonprofit organization files
files = Path('data/gold/by_state').glob('nonprofits_organizations_*.parquet')
dfs = [pd.read_parquet(f) for f in files]

# Combine
full_dataset = pd.concat(dfs, ignore_index=True)
print(f"All nonprofits: {len(full_dataset):,}")
```

## Managing State Splits

### Create/Update State Splits
```bash
# Split all files by state
python scripts/split_gold_by_state.py --all

# Split specific file
python scripts/split_gold_by_state.py --file nonprofits_organizations.parquet

# Dry run (see what would happen)
python scripts/split_gold_by_state.py --all --dry-run

# View statistics
python scripts/split_gold_by_state.py --stats
```

### Upload to HuggingFace

Upload state-specific datasets to HuggingFace for public access:

```bash
# Upload all states
python scripts/upload_state_splits_to_hf.py --all

# Upload Alabama only
python scripts/upload_state_splits_to_hf.py --state AL

# Upload multiple states
python scripts/upload_state_splits_to_hf.py --states AL AK AZ CA

# Dry run
python scripts/upload_state_splits_to_hf.py --all --dry-run
```

This creates state-specific repos on HuggingFace:
- `CommunityOne/one-data-AL` - All Alabama data
- `CommunityOne/one-data-CA` - All California data
- `CommunityOne/one-data-TX` - All Texas data

## Statistics

**Total State-Split Files**: 388 files  
**Total Size**: 172 MB  
**States/Territories**: 62 (all US states, DC, territories, military addresses)

**File Breakdown**:
- 62 nonprofit organization files
- 62 nonprofit location files
- 56 government domain files
- 52 jurisdiction city files
- 52 jurisdiction county files
- 52 jurisdiction school district files
- 52 jurisdiction township files

## Notes

- Original monolithic files are still in `data/gold/` for backward compatibility
- State-split files use standard 2-letter state codes (AL, AK, AZ, etc.)
- Includes US territories: PR, VI, GU, AS, MP
- Includes military addresses: AA, AE, AP
- Some files have fewer states if no data exists for that state