Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| sidebar_position: 6 | |
| # Partitioned Datasets | |
| Partitioned datasets provide the best of both worlds: efficient state-level queries **and** the ability to query the full national dataset. | |
| ## Why Partitioning? | |
| ### Problem with Separate Files | |
| ```python | |
| # ❌ Must know which files to load | |
| df_al = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet') | |
| df_ga = pd.read_parquet('data/gold/by_state/nonprofits_organizations_GA.parquet') | |
| df = pd.concat([df_al, df_ga]) # Manual combining | |
| ``` | |
| ### Solution: Partitioned Datasets | |
| ```python | |
| # ✅ Single dataset, automatic filtering | |
| df = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[('state', 'in', ['AL', 'GA'])]) | |
| # Only reads AL and GA partitions - efficient! | |
| ``` | |
| ## Benefits | |
| 1. **Partition Pruning**: Only reads relevant data | |
| - Query Alabama → Only reads AL partition (1 MB) | |
| - Query California → Only reads CA partition (8 MB) | |
| - Not entire 72 MB file! | |
| 2. **Single Logical Table**: Query like a database | |
| ```python | |
| # Filter, aggregate, join - just like SQL | |
| df = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[('state', '=', 'AL')]) | |
| ``` | |
| 3. **Works with Analytics Tools**: | |
| - Apache Spark (built-in partition pruning) | |
| - DuckDB (automatic partition detection) | |
| - AWS Athena (S3 partitioning) | |
| - Pandas (filter pushdown) | |
| 4. **Easy Updates**: Update one state without touching others | |
| ```python | |
| # Update only Alabama data | |
| df_new = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[('state', '!=', 'AL')]) | |
| df_al_updated = get_updated_alabama_data() | |
| df = pd.concat([df_new, df_al_updated]) | |
| df.to_parquet('data/gold/nonprofits_organizations', | |
| partition_cols=['state']) | |
| ``` | |
| ## Directory Structure | |
| ``` | |
| data/gold/ | |
| ├── nonprofits_organizations/ # Partitioned dataset (207 MB) | |
| │ ├── state=AL/ | |
| │ │ └── part-0.parquet (1 MB) | |
| │ ├── state=AK/ | |
| │ │ └── part-0.parquet (0.5 MB) | |
| │ ├── state=CA/ | |
| │ │ └── part-0.parquet (8 MB) | |
| │ └── ... (63 states) | |
| ├── nonprofits_locations/ # Partitioned dataset (99 MB) | |
| ├── nonprofits_financials/ # Partitioned dataset (78 MB) | |
| ├── nonprofits_programs/ # Partitioned dataset (67 MB) | |
| ├── jurisdictions_cities/ # Partitioned dataset (2.9 MB) | |
| ├── jurisdictions_counties/ # Partitioned dataset (1.1 MB) | |
| ├── jurisdictions_school_districts/ # Partitioned dataset (1.8 MB) | |
| ├── jurisdictions_townships/ # Partitioned dataset (3.3 MB) | |
| ├── domains_gsa_domains/ # Partitioned dataset (1.4 MB) | |
| ├── causes_everyorg_causes.parquet # Lookup table (no partitioning) | |
| └── causes_ntee_codes.parquet # Lookup table (no partitioning) | |
| ``` | |
| **Note**: All datasets with state information are now partitioned. Lookup tables and non-state data remain as single files. | |
| ### State Column Handling | |
| Different datasets get their state information in different ways: | |
| - **Direct state column**: `nonprofits_organizations`, `nonprofits_locations` | |
| - **State via EIN join**: `nonprofits_financials`, `nonprofits_programs` (joined with organizations by EIN) | |
| - **USPS column**: `jurisdictions_cities`, `jurisdictions_counties`, etc. (renamed to `state`) | |
| - **State column**: `domains_gsa_domains` (capitalized, normalized to `state`) | |
| The partitioning script automatically handles these differences, ensuring all datasets use a consistent `state` partition column. | |
| ## Creating Partitioned Datasets | |
| ```bash | |
| # Create all partitioned datasets | |
| python scripts/create_partitioned_datasets.py --all | |
| # Create specific dataset | |
| python scripts/create_partitioned_datasets.py --file nonprofits_organizations.parquet | |
| # Dry run (see what would be created) | |
| python scripts/create_partitioned_datasets.py --all --dry-run | |
| ``` | |
| ## Query Examples | |
| ### Pandas | |
| ```python | |
| import pandas as pd | |
| # Read single state (only reads 1 MB, not 72 MB!) | |
| df = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[('state', '=', 'AL')]) | |
| print(f"Alabama nonprofits: {len(df):,}") | |
| # Read multiple states | |
| df = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[('state', 'in', ['AL', 'GA', 'FL', 'MS', 'TN'])]) | |
| print(f"Southeast nonprofits: {len(df):,}") | |
| # Read all states (reads all partitions) | |
| df = pd.read_parquet('data/gold/nonprofits_organizations') | |
| print(f"All nonprofits: {len(df):,}") | |
| # Complex filters (still efficient!) | |
| df = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[ | |
| ('state', '=', 'AL'), | |
| ('ntee_code', '=', 'E') # Health orgs only | |
| ]) | |
| ``` | |
| ### DuckDB | |
| ```python | |
| import duckdb | |
| # DuckDB automatically detects partitions | |
| con = duckdb.connect() | |
| # Query with partition pruning | |
| result = con.execute(""" | |
| SELECT state, COUNT(*) as org_count | |
| FROM 'data/gold/nonprofits_organizations/**/*.parquet' | |
| WHERE state IN ('AL', 'GA', 'FL') | |
| GROUP BY state | |
| """).fetchdf() | |
| print(result) | |
| ``` | |
| ### PySpark | |
| ```python | |
| from pyspark.sql import SparkSession | |
| spark = SparkSession.builder.getOrCreate() | |
| # Spark automatically uses partition pruning | |
| df = spark.read.parquet('data/gold/nonprofits_organizations') | |
| # Only reads AL partition | |
| al_orgs = df.filter(df.state == 'AL') | |
| print(f"Alabama nonprofits: {al_orgs.count():,}") | |
| # Join partitioned datasets efficiently | |
| cities = spark.read.parquet('data/gold/jurisdictions_cities') | |
| nonprofits = spark.read.parquet('data/gold/nonprofits_organizations') | |
| # Both filter to AL before join - very efficient! | |
| result = nonprofits.filter(nonprofits.state == 'AL') \ | |
| .join(cities.filter(cities.state == 'AL'), | |
| on='city_name') | |
| ``` | |
| ## Performance Comparison | |
| ### Query: Alabama Nonprofits | |
| | Method | Data Read | Time | Memory | | |
| |--------|-----------|------|--------| | |
| | **Full file** | 72 MB | 2.5s | 400 MB | | |
| | **Separate file** | 1 MB | 0.1s | 8 MB | | |
| | **Partitioned (filtered)** | 1 MB | 0.1s | 8 MB | | |
| ### Query: All Nonprofits | |
| | Method | Data Read | Time | Memory | | |
| |--------|-----------|------|--------| | |
| | **Full file** | 72 MB | 2.5s | 400 MB | | |
| | **Separate files** | 72 MB (62 files) | 3.2s | 400 MB | | |
| | **Partitioned** | 72 MB | 2.5s | 400 MB | | |
| ### Query: 5 Southeastern States | |
| | Method | Data Read | Time | Memory | | |
| |--------|-----------|------|--------| | |
| | **Full file** | 72 MB | 2.5s | 400 MB | | |
| | **Separate files** | 5 MB (5 files) | 0.2s | 35 MB | | |
| | **Partitioned (filtered)** | 5 MB | 0.2s | 35 MB | | |
| **Winner: Partitioned datasets** - Same efficiency as separate files with full dataset queryability! | |
| ## Available Partitioned Datasets | |
| All files with state information can be partitioned: | |
| - `nonprofits_organizations/` (62 state partitions) | |
| - `nonprofits_locations/` (62 state partitions) | |
| - `nonprofits_financials/` (62 state partitions) | |
| - `nonprofits_programs/` (62 state partitions) | |
| - `jurisdictions_cities/` (52 state partitions) | |
| - `jurisdictions_counties/` (52 state partitions) | |
| - `jurisdictions_school_districts/` (52 state partitions) | |
| - `jurisdictions_townships/` (52 state partitions) | |
| - `domains_gsa_domains/` (56 state partitions) | |
| ## Uploading to HuggingFace | |
| Partitioned datasets work great with HuggingFace: | |
| ```python | |
| from datasets import Dataset | |
| import pandas as pd | |
| # Read partitioned data | |
| df = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[('state', '=', 'AL')]) | |
| # Upload state-specific subset | |
| dataset = Dataset.from_pandas(df) | |
| dataset.push_to_hub("CommunityOne/one-data-AL") | |
| ``` | |
| Or upload the entire partitioned structure: | |
| ```bash | |
| # Upload partitioned directory to HuggingFace | |
| # Each state becomes a separate shard | |
| huggingface-cli upload CommunityOne/one-nonprofits \ | |
| data/gold/nonprofits_organizations \ | |
| --repo-type dataset | |
| ``` | |
| ## Best Practices | |
| 1. **Always use filters** when reading partitioned data for specific states | |
| ```python | |
| # ✅ Efficient | |
| df = pd.read_parquet('path', filters=[('state', '=', 'AL')]) | |
| # ❌ Inefficient (reads all partitions then filters) | |
| df = pd.read_parquet('path') | |
| df = df[df['state'] == 'AL'] | |
| ``` | |
| 2. **Use `in` operator for multiple states** | |
| ```python | |
| df = pd.read_parquet('path', | |
| filters=[('state', 'in', ['AL', 'GA', 'FL'])]) | |
| ``` | |
| 3. **Combine with other filters** for maximum efficiency | |
| ```python | |
| df = pd.read_parquet('path', | |
| filters=[ | |
| ('state', '=', 'AL'), | |
| ('revenue', '>', 1000000) | |
| ]) | |
| ``` | |
| ## Migration from Separate Files | |
| If you have code using separate files: | |
| ```python | |
| # Old approach | |
| df = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet') | |
| # New approach (equivalent) | |
| df = pd.read_parquet('data/gold/nonprofits_organizations', | |
| filters=[('state', '=', 'AL')]) | |
| ``` | |
| Both work identically! The partitioned approach is recommended for new code. | |