open-navigator / website /docs /guides /huggingface-datasets.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
metadata
sidebar_position: 8

HuggingFace Dataset Integration

Push your nonprofit data to HuggingFace Hub and query it from your React application using the free Datasets Server API (no authentication required for public datasets!).

🎯 Overview

With 1.9M+ nonprofits now available from IRS EO-BMF, you can:

  1. Upload all 4 nonprofit gold tables to HuggingFace (free unlimited storage)
  2. Query datasets from React using HuggingFace Datasets Server API
  3. Search nonprofits by name, state, NTEE code, or keywords
  4. Paginate through millions of records efficiently

Key Benefits:

  • βœ… Free unlimited storage (public datasets)
  • βœ… No authentication required for reading public datasets
  • βœ… REST API - works from any language (Python, JavaScript, curl)
  • βœ… Automatic caching and CDN delivery by HuggingFace
  • βœ… Searchable with full-text search built-in

πŸ“€ Step 1: Upload Datasets to HuggingFace

Prerequisites

# Install HuggingFace libraries
pip install huggingface_hub datasets pyarrow

# Get your token from https://huggingface.co/settings/tokens
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN_HERE"

Add to .env:

HUGGINGFACE_TOKEN=hf_your_write_token_here

Upload All Nonprofit Tables

cd /home/developer/projects/open-navigator

# Upload all 4 tables (organizations, financials, programs, locations)
python scripts/upload_nonprofits_to_hf.py --all

# Upload specific table
python scripts/upload_nonprofits_to_hf.py --table organizations

# Upload to your own repo (change username)
python scripts/upload_nonprofits_to_hf.py --all --repo "your-username/nonprofits"

Expected Output:

βœ… Logged in to Hugging Face
βœ… Repository ready: https://huggingface.co/datasets/CommunityOne/one-nonprofits
πŸ“€ Uploading organizations from data/gold/nonprofits_organizations.parquet
  Rows: 1,952,238
  Columns: 28
  Size: 156.43 MB
  Pushing to CommunityOne/one-nonprofits (split: organizations)
βœ… Uploaded organizations: 1,952,238 records
   View at: https://huggingface.co/datasets/CommunityOne/one-nonprofits/viewer/organizations
πŸ“€ Uploading financials from data/gold/nonprofits_financials.parquet
  ...
πŸŽ‰ All uploads complete!

What Gets Uploaded

Table Records Description
organizations 1.9M+ Main nonprofit data (EIN, name, NTEE, subsection)
financials 1.9M+ Assets, income, revenue, ruling date
programs 1.9M+ Activity codes, group affiliation
locations 1.9M+ Address, city, state, ZIP code

πŸ” Step 2: Query from Python

Basic Query

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("CommunityOne/one-nonprofits")

# Access specific tables (splits)
orgs = dataset["organizations"]
financials = dataset["financials"]
locations = dataset["locations"]

print(f"Total organizations: {len(orgs):,}")
# Output: Total organizations: 1,952,238

Convert to Pandas

import pandas as pd

# Load as pandas DataFrame
df = pd.DataFrame(dataset["organizations"])

# Filter by state
alabama = df[df['state'] == 'AL']
print(f"Alabama nonprofits: {len(alabama):,}")
# Output: Alabama nonprofits: 26,148

# Filter by NTEE category (E = Health)
health = df[df['ntee_code'].str.startswith('E', na=False)]
print(f"Health organizations: {len(health):,}")
# Output: Health organizations: 80,000+

Search by Keywords

# Search for "dental" in organization names
dental = df[df['name'].str.contains('dental', case=False, na=False)]
print(f"Dental organizations: {len(dental):,}")

# Filter dental orgs in California
ca_dental = dental[dental['state'] == 'CA']
print(f"California dental orgs: {len(ca_dental):,}")

Join Tables

# Join organizations with financials
orgs_df = pd.DataFrame(dataset["organizations"])
fin_df = pd.DataFrame(dataset["financials"])

# Merge on EIN
combined = orgs_df.merge(fin_df, on='ein', how='left')

# Find high-revenue health organizations in NY
ny_health = combined[
    (combined['state'] == 'NY') & 
    (combined['ntee_code'].str.startswith('E', na=False)) &
    (combined['revenue_amount'] > 1_000_000)
]
print(f"High-revenue NY health orgs: {len(ny_health):,}")

🌐 Step 3: Query from React/JavaScript

Install Utility

The HuggingFace query utility is already created at frontend/src/utils/huggingface.ts.

Basic Usage

import { fetchHFRows, searchHFDataset } from '../utils/huggingface';

// Fetch first 100 nonprofits
const response = await fetchHFRows({
  dataset: "CommunityOne/one-nonprofits",
  split: "organizations"
}, 0, 100);

const nonprofits = response.rows.map(r => r.row);
console.log(`Loaded ${nonprofits.length} nonprofits`);
console.log(`Total available: ${response.num_rows_total:,}`);

Search with React Query

import { useQuery } from '@tanstack/react-query';
import { searchNonprofits } from '../utils/huggingface';

function NonprofitSearch() {
  const [searchTerm, setSearchTerm] = useState('dental');
  const [state, setState] = useState('CA');
  
  const { data: nonprofits, isLoading } = useQuery({
    queryKey: ['nonprofits', searchTerm, state],
    queryFn: async () => {
      return await searchNonprofits({
        dataset: "CommunityOne/one-nonprofits",
        query: searchTerm,
        state: state,
        limit: 100
      });
    }
  });
  
  if (isLoading) return <div>Loading...</div>;
  
  return (
    <div>
      <h2>Found {nonprofits?.length} nonprofits</h2>
      {nonprofits?.map(org => (
        <div key={org.ein}>
          <h3>{org.name}</h3>
          <p>NTEE: {org.ntee_code} | State: {org.state}</p>
        </div>
      ))}
    </div>
  );
}

Pagination Example

import { useState } from 'react';
import { fetchHFRows } from '../utils/huggingface';

function NonprofitList() {
  const [page, setPage] = useState(0);
  const pageSize = 100;
  
  const { data, isLoading } = useQuery({
    queryKey: ['nonprofits', page],
    queryFn: async () => {
      return await fetchHFRows({
        dataset: "CommunityOne/one-nonprofits",
        split: "organizations"
      }, page * pageSize, pageSize);
    }
  });
  
  return (
    <div>
      {/* Display nonprofits */}
      {data?.rows.map(r => (
        <div key={r.row.ein}>{r.row.name}</div>
      ))}
      
      {/* Pagination controls */}
      <button onClick={() => setPage(p => Math.max(0, p - 1))}>
        Previous
      </button>
      <span>Page {page + 1}</span>
      <button onClick={() => setPage(p => p + 1)}>
        Next
      </button>
    </div>
  );
}

πŸ”„ Step 4: Update Existing Pages

Update Nonprofits Page

Edit frontend/src/pages/Nonprofits.tsx:

import { useQuery } from '@tanstack/react-query';
import { searchNonprofits } from '../utils/huggingface';

const DATASET_NAME = "CommunityOne/one-nonprofits";

export default function Nonprofits() {
  const [state, setState] = useState<string>('');
  const [nteeCode, setNteeCode] = useState<string>('');
  const [searchQuery, setSearchQuery] = useState<string>('');
  
  const { data: nonprofits, isLoading } = useQuery({
    queryKey: ['nonprofits', state, nteeCode, searchQuery],
    queryFn: async () => {
      return await searchNonprofits({
        dataset: DATASET_NAME,
        query: searchQuery || undefined,
        state: state || undefined,
        nteeCode: nteeCode || undefined,
        limit: 100
      });
    }
  });
  
  return (
    <div className="p-6">
      <h1>Nonprofits ({nonprofits?.length || 0} found)</h1>
      
      {/* Filters */}
      <div className="filters">
        <input
          type="text"
          placeholder="Search by name..."
          value={searchQuery}
          onChange={e => setSearchQuery(e.target.value)}
        />
        
        <select value={state} onChange={e => setState(e.target.value)}>
          <option value="">All States</option>
          <option value="AL">Alabama</option>
          <option value="CA">California</option>
          <option value="NY">New York</option>
          {/* Add all 50 states */}
        </select>
        
        <select value={nteeCode} onChange={e => setNteeCode(e.target.value)}>
          <option value="">All Categories</option>
          <option value="E">Health (E)</option>
          <option value="P">Human Services (P)</option>
          <option value="X">Religion (X)</option>
          {/* Add all NTEE codes */}
        </select>
      </div>
      
      {/* Results */}
      {isLoading ? (
        <div>Loading...</div>
      ) : (
        <div className="results">
          {nonprofits?.map(org => (
            <div key={org.ein} className="nonprofit-card">
              <h3>{org.name}</h3>
              <p>EIN: {org.ein}</p>
              <p>NTEE: {org.ntee_code}</p>
              <p>Location: {org.city}, {org.state} {org.zip_code}</p>
              {org.revenue_amount && (
                <p>Revenue: ${org.revenue_amount.toLocaleString()}</p>
              )}
            </div>
          ))}
        </div>
      )}
    </div>
  );
}

πŸ“Š Step 5: Add Advanced Features

Autocomplete Search

import { useState, useEffect } from 'react';
import { searchHFDataset } from '../utils/huggingface';

function NonprofitAutocomplete() {
  const [query, setQuery] = useState('');
  const [suggestions, setSuggestions] = useState<any[]>([]);
  
  useEffect(() => {
    if (query.length < 3) {
      setSuggestions([]);
      return;
    }
    
    const fetchSuggestions = async () => {
      const response = await searchHFDataset({
        dataset: "CommunityOne/one-nonprofits",
        split: "organizations"
      }, query, 0, 10);
      
      setSuggestions(response.rows.map(r => r.row));
    };
    
    const timeoutId = setTimeout(fetchSuggestions, 300);
    return () => clearTimeout(timeoutId);
  }, [query]);
  
  return (
    <div>
      <input
        type="text"
        value={query}
        onChange={e => setQuery(e.target.value)}
        placeholder="Search nonprofits..."
      />
      
      {suggestions.length > 0 && (
        <ul>
          {suggestions.map(org => (
            <li key={org.ein}>{org.name} - {org.city}, {org.state}</li>
          ))}
        </ul>
      )}
    </div>
  );
}

Map Visualization

import { useQuery } from '@tanstack/react-query';
import { fetchNonprofitsByState } from '../utils/huggingface';

function NonprofitMap() {
  const [selectedState, setSelectedState] = useState('CA');
  
  const { data: nonprofits } = useQuery({
    queryKey: ['nonprofits-map', selectedState],
    queryFn: async () => {
      return await fetchNonprofitsByState(
        "CommunityOne/one-nonprofits",
        selectedState,
        1000
      );
    }
  });
  
  return (
    <div>
      <select value={selectedState} onChange={e => setSelectedState(e.target.value)}>
        {/* State options */}
      </select>
      
      <Map
        markers={nonprofits?.map(org => ({
          lat: org.latitude,
          lng: org.longitude,
          name: org.name
        }))}
      />
    </div>
  );
}

πŸš€ API Reference

Python Functions

from datasets import load_dataset
import pandas as pd

# Load dataset
dataset = load_dataset("CommunityOne/one-nonprofits")

# Get specific split
orgs = dataset["organizations"]
financials = dataset["financials"]
programs = dataset["programs"]
locations = dataset["locations"]

# Convert to pandas
df = pd.DataFrame(orgs)

# Filter
filtered = df[df['state'] == 'CA']

# Search
results = df[df['name'].str.contains('dental', case=False)]

JavaScript Functions

import {
  fetchHFRows,           // Fetch paginated rows
  searchHFDataset,       // Full-text search
  getHFDatasetSize,      // Get total row count
  fetchAllNonprofits,    // Fetch multiple pages
  fetchNonprofitsByState,// Filter by state
  fetchNonprofitsByNTEE, // Filter by NTEE code
  searchNonprofits       // Combined search + filters
} from '../utils/huggingface';

REST API (No Auth Required!)

# Get first 100 organizations
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=100"

# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental&offset=0&length=100"

# Get dataset size
curl "https://datasets-server.huggingface.co/size?dataset=CommunityOne/one-nonprofits&config=default&split=organizations"

🎯 Next Steps

  1. Upload your datasets:

    python scripts/upload_nonprofits_to_hf.py --all
    
  2. Test the API:

    curl "https://datasets-server.huggingface.co/rows?dataset=YOUR_USERNAME/YOUR_DATASET&config=default&split=organizations&offset=0&length=10"
    
  3. Update your React pages:

    • Replace local API calls with HuggingFace queries
    • Add pagination for large datasets
    • Implement autocomplete search
    • Create map visualizations
  4. Monitor usage:

πŸ“š Additional Resources