open-navigator / website /docs /guides /partitioned-datasets.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
metadata
sidebar_position: 6

Partitioned Datasets

Partitioned datasets provide the best of both worlds: efficient state-level queries and the ability to query the full national dataset.

Why Partitioning?

Problem with Separate Files

# ❌ Must know which files to load
df_al = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet')
df_ga = pd.read_parquet('data/gold/by_state/nonprofits_organizations_GA.parquet')
df = pd.concat([df_al, df_ga])  # Manual combining

Solution: Partitioned Datasets

# βœ… Single dataset, automatic filtering
df = pd.read_parquet('data/gold/nonprofits_organizations',
                     filters=[('state', 'in', ['AL', 'GA'])])
# Only reads AL and GA partitions - efficient!

Benefits

  1. Partition Pruning: Only reads relevant data

    • Query Alabama β†’ Only reads AL partition (1 MB)
    • Query California β†’ Only reads CA partition (8 MB)
    • Not entire 72 MB file!
  2. Single Logical Table: Query like a database

    # Filter, aggregate, join - just like SQL
    df = pd.read_parquet('data/gold/nonprofits_organizations',
                         filters=[('state', '=', 'AL')])
    
  3. Works with Analytics Tools:

    • Apache Spark (built-in partition pruning)
    • DuckDB (automatic partition detection)
    • AWS Athena (S3 partitioning)
    • Pandas (filter pushdown)
  4. Easy Updates: Update one state without touching others

    # Update only Alabama data
    df_new = pd.read_parquet('data/gold/nonprofits_organizations',
                             filters=[('state', '!=', 'AL')])
    df_al_updated = get_updated_alabama_data()
    df = pd.concat([df_new, df_al_updated])
    df.to_parquet('data/gold/nonprofits_organizations',
                  partition_cols=['state'])
    

Directory Structure

data/gold/
β”œβ”€β”€ nonprofits_organizations/          # Partitioned dataset (207 MB)
β”‚   β”œβ”€β”€ state=AL/
β”‚   β”‚   └── part-0.parquet (1 MB)
β”‚   β”œβ”€β”€ state=AK/
β”‚   β”‚   └── part-0.parquet (0.5 MB)
β”‚   β”œβ”€β”€ state=CA/
β”‚   β”‚   └── part-0.parquet (8 MB)
β”‚   └── ... (63 states)
β”œβ”€β”€ nonprofits_locations/              # Partitioned dataset (99 MB)
β”œβ”€β”€ nonprofits_financials/             # Partitioned dataset (78 MB)
β”œβ”€β”€ nonprofits_programs/               # Partitioned dataset (67 MB)
β”œβ”€β”€ jurisdictions_cities/              # Partitioned dataset (2.9 MB)
β”œβ”€β”€ jurisdictions_counties/            # Partitioned dataset (1.1 MB)
β”œβ”€β”€ jurisdictions_school_districts/    # Partitioned dataset (1.8 MB)
β”œβ”€β”€ jurisdictions_townships/           # Partitioned dataset (3.3 MB)
β”œβ”€β”€ domains_gsa_domains/               # Partitioned dataset (1.4 MB)
β”œβ”€β”€ causes_everyorg_causes.parquet     # Lookup table (no partitioning)
└── causes_ntee_codes.parquet          # Lookup table (no partitioning)

Note: All datasets with state information are now partitioned. Lookup tables and non-state data remain as single files.

State Column Handling

Different datasets get their state information in different ways:

  • Direct state column: nonprofits_organizations, nonprofits_locations
  • State via EIN join: nonprofits_financials, nonprofits_programs (joined with organizations by EIN)
  • USPS column: jurisdictions_cities, jurisdictions_counties, etc. (renamed to state)
  • State column: domains_gsa_domains (capitalized, normalized to state)

The partitioning script automatically handles these differences, ensuring all datasets use a consistent state partition column.

Creating Partitioned Datasets

# Create all partitioned datasets
python scripts/create_partitioned_datasets.py --all

# Create specific dataset
python scripts/create_partitioned_datasets.py --file nonprofits_organizations.parquet

# Dry run (see what would be created)
python scripts/create_partitioned_datasets.py --all --dry-run

Query Examples

Pandas

import pandas as pd

# Read single state (only reads 1 MB, not 72 MB!)
df = pd.read_parquet('data/gold/nonprofits_organizations',
                     filters=[('state', '=', 'AL')])
print(f"Alabama nonprofits: {len(df):,}")

# Read multiple states
df = pd.read_parquet('data/gold/nonprofits_organizations',
                     filters=[('state', 'in', ['AL', 'GA', 'FL', 'MS', 'TN'])])
print(f"Southeast nonprofits: {len(df):,}")

# Read all states (reads all partitions)
df = pd.read_parquet('data/gold/nonprofits_organizations')
print(f"All nonprofits: {len(df):,}")

# Complex filters (still efficient!)
df = pd.read_parquet('data/gold/nonprofits_organizations',
                     filters=[
                         ('state', '=', 'AL'),
                         ('ntee_code', '=', 'E')  # Health orgs only
                     ])

DuckDB

import duckdb

# DuckDB automatically detects partitions
con = duckdb.connect()

# Query with partition pruning
result = con.execute("""
    SELECT state, COUNT(*) as org_count
    FROM 'data/gold/nonprofits_organizations/**/*.parquet'
    WHERE state IN ('AL', 'GA', 'FL')
    GROUP BY state
""").fetchdf()

print(result)

PySpark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Spark automatically uses partition pruning
df = spark.read.parquet('data/gold/nonprofits_organizations')

# Only reads AL partition
al_orgs = df.filter(df.state == 'AL')
print(f"Alabama nonprofits: {al_orgs.count():,}")

# Join partitioned datasets efficiently
cities = spark.read.parquet('data/gold/jurisdictions_cities')
nonprofits = spark.read.parquet('data/gold/nonprofits_organizations')

# Both filter to AL before join - very efficient!
result = nonprofits.filter(nonprofits.state == 'AL') \
                   .join(cities.filter(cities.state == 'AL'), 
                         on='city_name')

Performance Comparison

Query: Alabama Nonprofits

Method Data Read Time Memory
Full file 72 MB 2.5s 400 MB
Separate file 1 MB 0.1s 8 MB
Partitioned (filtered) 1 MB 0.1s 8 MB

Query: All Nonprofits

Method Data Read Time Memory
Full file 72 MB 2.5s 400 MB
Separate files 72 MB (62 files) 3.2s 400 MB
Partitioned 72 MB 2.5s 400 MB

Query: 5 Southeastern States

Method Data Read Time Memory
Full file 72 MB 2.5s 400 MB
Separate files 5 MB (5 files) 0.2s 35 MB
Partitioned (filtered) 5 MB 0.2s 35 MB

Winner: Partitioned datasets - Same efficiency as separate files with full dataset queryability!

Available Partitioned Datasets

All files with state information can be partitioned:

  • nonprofits_organizations/ (62 state partitions)
  • nonprofits_locations/ (62 state partitions)
  • nonprofits_financials/ (62 state partitions)
  • nonprofits_programs/ (62 state partitions)
  • jurisdictions_cities/ (52 state partitions)
  • jurisdictions_counties/ (52 state partitions)
  • jurisdictions_school_districts/ (52 state partitions)
  • jurisdictions_townships/ (52 state partitions)
  • domains_gsa_domains/ (56 state partitions)

Uploading to HuggingFace

Partitioned datasets work great with HuggingFace:

from datasets import Dataset
import pandas as pd

# Read partitioned data
df = pd.read_parquet('data/gold/nonprofits_organizations',
                     filters=[('state', '=', 'AL')])

# Upload state-specific subset
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("CommunityOne/one-data-AL")

Or upload the entire partitioned structure:

# Upload partitioned directory to HuggingFace
# Each state becomes a separate shard
huggingface-cli upload CommunityOne/one-nonprofits \
  data/gold/nonprofits_organizations \
  --repo-type dataset

Best Practices

  1. Always use filters when reading partitioned data for specific states

    # βœ… Efficient
    df = pd.read_parquet('path', filters=[('state', '=', 'AL')])
    
    # ❌ Inefficient (reads all partitions then filters)
    df = pd.read_parquet('path')
    df = df[df['state'] == 'AL']
    
  2. Use in operator for multiple states

    df = pd.read_parquet('path', 
                         filters=[('state', 'in', ['AL', 'GA', 'FL'])])
    
  3. Combine with other filters for maximum efficiency

    df = pd.read_parquet('path', 
                         filters=[
                             ('state', '=', 'AL'),
                             ('revenue', '>', 1000000)
                         ])
    

Migration from Separate Files

If you have code using separate files:

# Old approach
df = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet')

# New approach (equivalent)
df = pd.read_parquet('data/gold/nonprofits_organizations',
                     filters=[('state', '=', 'AL')])

Both work identically! The partitioned approach is recommended for new code.