Spaces:

CommunityOne
/

open-navigator

Running on CPU Upgrade

App Files Files Community

open-navigator / website /docs /guides /partitioned-datasets.md

jcbowyer

Clean HuggingFace deployment without binary files

61d29fc 29 days ago

preview code

raw

history blame contribute delete

9.22 kB

	---
	sidebar_position: 6
	---

	# Partitioned Datasets

	Partitioned datasets provide the best of both worlds: efficient state-level queries and the ability to query the full national dataset.

	## Why Partitioning?

	### Problem with Separate Files
	```python
	# ❌ Must know which files to load
	df_al = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet')
	df_ga = pd.read_parquet('data/gold/by_state/nonprofits_organizations_GA.parquet')
	df = pd.concat([df_al, df_ga]) # Manual combining
	```

	### Solution: Partitioned Datasets
	```python
	# ✅ Single dataset, automatic filtering
	df = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[('state', 'in', ['AL', 'GA'])])
	# Only reads AL and GA partitions - efficient!
	```

	## Benefits

	1. Partition Pruning: Only reads relevant data
	- Query Alabama → Only reads AL partition (1 MB)
	- Query California → Only reads CA partition (8 MB)
	- Not entire 72 MB file!

	2. Single Logical Table: Query like a database
	```python
	# Filter, aggregate, join - just like SQL
	df = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[('state', '=', 'AL')])
	```

	3. Works with Analytics Tools:
	- Apache Spark (built-in partition pruning)
	- DuckDB (automatic partition detection)
	- AWS Athena (S3 partitioning)
	- Pandas (filter pushdown)

	4. Easy Updates: Update one state without touching others
	```python
	# Update only Alabama data
	df_new = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[('state', '!=', 'AL')])
	df_al_updated = get_updated_alabama_data()
	df = pd.concat([df_new, df_al_updated])
	df.to_parquet('data/gold/nonprofits_organizations',
	partition_cols=['state'])
	```

	## Directory Structure

	```
	data/gold/
	├── nonprofits_organizations/ # Partitioned dataset (207 MB)
	│ ├── state=AL/
	│ │ └── part-0.parquet (1 MB)
	│ ├── state=AK/
	│ │ └── part-0.parquet (0.5 MB)
	│ ├── state=CA/
	│ │ └── part-0.parquet (8 MB)
	│ └── ... (63 states)
	├── nonprofits_locations/ # Partitioned dataset (99 MB)
	├── nonprofits_financials/ # Partitioned dataset (78 MB)
	├── nonprofits_programs/ # Partitioned dataset (67 MB)
	├── jurisdictions_cities/ # Partitioned dataset (2.9 MB)
	├── jurisdictions_counties/ # Partitioned dataset (1.1 MB)
	├── jurisdictions_school_districts/ # Partitioned dataset (1.8 MB)
	├── jurisdictions_townships/ # Partitioned dataset (3.3 MB)
	├── domains_gsa_domains/ # Partitioned dataset (1.4 MB)
	├── causes_everyorg_causes.parquet # Lookup table (no partitioning)
	└── causes_ntee_codes.parquet # Lookup table (no partitioning)
	```

	Note: All datasets with state information are now partitioned. Lookup tables and non-state data remain as single files.

	### State Column Handling

	Different datasets get their state information in different ways:

	- Direct state column: `nonprofits_organizations`, `nonprofits_locations`
	- State via EIN join: `nonprofits_financials`, `nonprofits_programs` (joined with organizations by EIN)
	- USPS column: `jurisdictions_cities`, `jurisdictions_counties`, etc. (renamed to `state`)
	- State column: `domains_gsa_domains` (capitalized, normalized to `state`)

	The partitioning script automatically handles these differences, ensuring all datasets use a consistent `state` partition column.

	## Creating Partitioned Datasets

	```bash
	# Create all partitioned datasets
	python scripts/create_partitioned_datasets.py --all

	# Create specific dataset
	python scripts/create_partitioned_datasets.py --file nonprofits_organizations.parquet

	# Dry run (see what would be created)
	python scripts/create_partitioned_datasets.py --all --dry-run
	```

	## Query Examples

	### Pandas

	```python
	import pandas as pd

	# Read single state (only reads 1 MB, not 72 MB!)
	df = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[('state', '=', 'AL')])
	print(f"Alabama nonprofits: {len(df):,}")

	# Read multiple states
	df = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[('state', 'in', ['AL', 'GA', 'FL', 'MS', 'TN'])])
	print(f"Southeast nonprofits: {len(df):,}")

	# Read all states (reads all partitions)
	df = pd.read_parquet('data/gold/nonprofits_organizations')
	print(f"All nonprofits: {len(df):,}")

	# Complex filters (still efficient!)
	df = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[
	('state', '=', 'AL'),
	('ntee_code', '=', 'E') # Health orgs only
	])
	```

	### DuckDB

	```python
	import duckdb

	# DuckDB automatically detects partitions
	con = duckdb.connect()

	# Query with partition pruning
	result = con.execute("""
	SELECT state, COUNT(*) as org_count
	FROM 'data/gold/nonprofits_organizations/*/.parquet'
	WHERE state IN ('AL', 'GA', 'FL')
	GROUP BY state
	""").fetchdf()

	print(result)
	```

	### PySpark

	```python
	from pyspark.sql import SparkSession

	spark = SparkSession.builder.getOrCreate()

	# Spark automatically uses partition pruning
	df = spark.read.parquet('data/gold/nonprofits_organizations')

	# Only reads AL partition
	al_orgs = df.filter(df.state == 'AL')
	print(f"Alabama nonprofits: {al_orgs.count():,}")

	# Join partitioned datasets efficiently
	cities = spark.read.parquet('data/gold/jurisdictions_cities')
	nonprofits = spark.read.parquet('data/gold/nonprofits_organizations')

	# Both filter to AL before join - very efficient!
	result = nonprofits.filter(nonprofits.state == 'AL') \
	.join(cities.filter(cities.state == 'AL'),
	on='city_name')
	```

	## Performance Comparison

	### Query: Alabama Nonprofits

	\| Method \| Data Read \| Time \| Memory \|
	\|--------\|-----------\|------\|--------\|
	\| Full file \| 72 MB \| 2.5s \| 400 MB \|
	\| Separate file \| 1 MB \| 0.1s \| 8 MB \|
	\| Partitioned (filtered) \| 1 MB \| 0.1s \| 8 MB \|

	### Query: All Nonprofits

	\| Method \| Data Read \| Time \| Memory \|
	\|--------\|-----------\|------\|--------\|
	\| Full file \| 72 MB \| 2.5s \| 400 MB \|
	\| Separate files \| 72 MB (62 files) \| 3.2s \| 400 MB \|
	\| Partitioned \| 72 MB \| 2.5s \| 400 MB \|

	### Query: 5 Southeastern States

	\| Method \| Data Read \| Time \| Memory \|
	\|--------\|-----------\|------\|--------\|
	\| Full file \| 72 MB \| 2.5s \| 400 MB \|
	\| Separate files \| 5 MB (5 files) \| 0.2s \| 35 MB \|
	\| Partitioned (filtered) \| 5 MB \| 0.2s \| 35 MB \|

	Winner: Partitioned datasets - Same efficiency as separate files with full dataset queryability!

	## Available Partitioned Datasets

	All files with state information can be partitioned:

	- `nonprofits_organizations/` (62 state partitions)
	- `nonprofits_locations/` (62 state partitions)
	- `nonprofits_financials/` (62 state partitions)
	- `nonprofits_programs/` (62 state partitions)
	- `jurisdictions_cities/` (52 state partitions)
	- `jurisdictions_counties/` (52 state partitions)
	- `jurisdictions_school_districts/` (52 state partitions)
	- `jurisdictions_townships/` (52 state partitions)
	- `domains_gsa_domains/` (56 state partitions)

	## Uploading to HuggingFace

	Partitioned datasets work great with HuggingFace:

	```python
	from datasets import Dataset
	import pandas as pd

	# Read partitioned data
	df = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[('state', '=', 'AL')])

	# Upload state-specific subset
	dataset = Dataset.from_pandas(df)
	dataset.push_to_hub("CommunityOne/one-data-AL")
	```

	Or upload the entire partitioned structure:

	```bash
	# Upload partitioned directory to HuggingFace
	# Each state becomes a separate shard
	huggingface-cli upload CommunityOne/one-nonprofits \
	data/gold/nonprofits_organizations \
	--repo-type dataset
	```

	## Best Practices

	1. Always use filters when reading partitioned data for specific states
	```python
	# ✅ Efficient
	df = pd.read_parquet('path', filters=[('state', '=', 'AL')])

	# ❌ Inefficient (reads all partitions then filters)
	df = pd.read_parquet('path')
	df = df[df['state'] == 'AL']
	```

	2. Use `in` operator for multiple states
	```python
	df = pd.read_parquet('path',
	filters=[('state', 'in', ['AL', 'GA', 'FL'])])
	```

	3. Combine with other filters for maximum efficiency
	```python
	df = pd.read_parquet('path',
	filters=[
	('state', '=', 'AL'),
	('revenue', '>', 1000000)
	])
	```

	## Migration from Separate Files

	If you have code using separate files:

	```python
	# Old approach
	df = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet')

	# New approach (equivalent)
	df = pd.read_parquet('data/gold/nonprofits_organizations',
	filters=[('state', '=', 'AL')])
	```

	Both work identically! The partitioned approach is recommended for new code.