Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 4,973 Bytes
61d29fc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | ---
sidebar_position: 5
---
# State-Split Data Files (Deprecated)
:::warning Deprecated
This approach of splitting files into separate state files is **deprecated**.
Use **[Partitioned Datasets](./partitioned-datasets.md)** instead for:
- Same efficiency as separate files
- Ability to query across states
- Better analytics tool support
- Simpler data management
:::
All gold parquet files with state information were previously split into state-specific files. This has been replaced by partitioned datasets which offer the same benefits with better queryability.
## What Changed
Instead of downloading one massive file with all states:
- β `nonprofits_organizations.parquet` (72 MB, 1.9M records)
You can now download just the state(s) you need:
- β
`nonprofits_organizations_AL.parquet` (Alabama only, ~1 MB)
- β
`nonprofits_organizations_CA.parquet` (California only, ~8 MB)
- β
`nonprofits_organizations_TX.parquet` (Texas only, ~6 MB)
## Benefits
1. **Smaller Downloads**: Only download the data you need
2. **Faster Queries**: Load and analyze state-specific data faster
3. **Better Organization**: Easier to manage and share state-level datasets
4. **HuggingFace Friendly**: Avoids file size limits, enables state-specific repos
## File Structure
State-split files are located in `data/gold/by_state/`:
```
data/gold/by_state/
βββ nonprofits_organizations_AL.parquet
βββ nonprofits_organizations_AK.parquet
βββ nonprofits_locations_AL.parquet
βββ jurisdictions_cities_AL.parquet
βββ jurisdictions_counties_AL.parquet
βββ jurisdictions_school_districts_AL.parquet
βββ ... (388 total files)
```
## Files That Were Split
### Nonprofit Data (62 states/territories each)
- `nonprofits_organizations_*.parquet` - Organization details
- `nonprofits_locations_*.parquet` - Geographic locations
### Jurisdiction Data (52 states each)
- `jurisdictions_cities_*.parquet` - Cities and municipalities
- `jurisdictions_counties_*.parquet` - Counties
- `jurisdictions_school_districts_*.parquet` - School districts
- `jurisdictions_townships_*.parquet` - Townships
### Other Data (56 states each)
- `domains_gsa_domains_*.parquet` - Government domains
## Usage
### Load Alabama Nonprofits
```python
import pandas as pd
# Load only Alabama data
df = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet')
print(f"Alabama nonprofits: {len(df):,}")
```
### Load Multiple States
```python
import pandas as pd
from pathlib import Path
# Load all southeastern states
states = ['AL', 'GA', 'FL', 'MS', 'TN', 'SC', 'NC']
dfs = []
for state in states:
path = f'data/gold/by_state/nonprofits_organizations_{state}.parquet'
df = pd.read_parquet(path)
dfs.append(df)
# Combine into one DataFrame
southeast = pd.concat(dfs, ignore_index=True)
print(f"Southeast nonprofits: {len(southeast):,}")
```
### Recreate Full Dataset
```python
import pandas as pd
from pathlib import Path
# Load all nonprofit organization files
files = Path('data/gold/by_state').glob('nonprofits_organizations_*.parquet')
dfs = [pd.read_parquet(f) for f in files]
# Combine
full_dataset = pd.concat(dfs, ignore_index=True)
print(f"All nonprofits: {len(full_dataset):,}")
```
## Managing State Splits
### Create/Update State Splits
```bash
# Split all files by state
python scripts/split_gold_by_state.py --all
# Split specific file
python scripts/split_gold_by_state.py --file nonprofits_organizations.parquet
# Dry run (see what would happen)
python scripts/split_gold_by_state.py --all --dry-run
# View statistics
python scripts/split_gold_by_state.py --stats
```
### Upload to HuggingFace
Upload state-specific datasets to HuggingFace for public access:
```bash
# Upload all states
python scripts/upload_state_splits_to_hf.py --all
# Upload Alabama only
python scripts/upload_state_splits_to_hf.py --state AL
# Upload multiple states
python scripts/upload_state_splits_to_hf.py --states AL AK AZ CA
# Dry run
python scripts/upload_state_splits_to_hf.py --all --dry-run
```
This creates state-specific repos on HuggingFace:
- `CommunityOne/one-data-AL` - All Alabama data
- `CommunityOne/one-data-CA` - All California data
- `CommunityOne/one-data-TX` - All Texas data
## Statistics
**Total State-Split Files**: 388 files
**Total Size**: 172 MB
**States/Territories**: 62 (all US states, DC, territories, military addresses)
**File Breakdown**:
- 62 nonprofit organization files
- 62 nonprofit location files
- 56 government domain files
- 52 jurisdiction city files
- 52 jurisdiction county files
- 52 jurisdiction school district files
- 52 jurisdiction township files
## Notes
- Original monolithic files are still in `data/gold/` for backward compatibility
- State-split files use standard 2-letter state codes (AL, AK, AZ, etc.)
- Includes US territories: PR, VI, GU, AS, MP
- Includes military addresses: AA, AE, AP
- Some files have fewer states if no data exists for that state
|