Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
metadata
sidebar_position: 8
HuggingFace Dataset Integration
Push your nonprofit data to HuggingFace Hub and query it from your React application using the free Datasets Server API (no authentication required for public datasets!).
π― Overview
With 1.9M+ nonprofits now available from IRS EO-BMF, you can:
- Upload all 4 nonprofit gold tables to HuggingFace (free unlimited storage)
- Query datasets from React using HuggingFace Datasets Server API
- Search nonprofits by name, state, NTEE code, or keywords
- Paginate through millions of records efficiently
Key Benefits:
- β Free unlimited storage (public datasets)
- β No authentication required for reading public datasets
- β REST API - works from any language (Python, JavaScript, curl)
- β Automatic caching and CDN delivery by HuggingFace
- β Searchable with full-text search built-in
π€ Step 1: Upload Datasets to HuggingFace
Prerequisites
# Install HuggingFace libraries
pip install huggingface_hub datasets pyarrow
# Get your token from https://huggingface.co/settings/tokens
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN_HERE"
Add to .env:
HUGGINGFACE_TOKEN=hf_your_write_token_here
Upload All Nonprofit Tables
cd /home/developer/projects/open-navigator
# Upload all 4 tables (organizations, financials, programs, locations)
python scripts/upload_nonprofits_to_hf.py --all
# Upload specific table
python scripts/upload_nonprofits_to_hf.py --table organizations
# Upload to your own repo (change username)
python scripts/upload_nonprofits_to_hf.py --all --repo "your-username/nonprofits"
Expected Output:
β
Logged in to Hugging Face
β
Repository ready: https://huggingface.co/datasets/CommunityOne/one-nonprofits
π€ Uploading organizations from data/gold/nonprofits_organizations.parquet
Rows: 1,952,238
Columns: 28
Size: 156.43 MB
Pushing to CommunityOne/one-nonprofits (split: organizations)
β
Uploaded organizations: 1,952,238 records
View at: https://huggingface.co/datasets/CommunityOne/one-nonprofits/viewer/organizations
π€ Uploading financials from data/gold/nonprofits_financials.parquet
...
π All uploads complete!
What Gets Uploaded
| Table | Records | Description |
|---|---|---|
| organizations | 1.9M+ | Main nonprofit data (EIN, name, NTEE, subsection) |
| financials | 1.9M+ | Assets, income, revenue, ruling date |
| programs | 1.9M+ | Activity codes, group affiliation |
| locations | 1.9M+ | Address, city, state, ZIP code |
π Step 2: Query from Python
Basic Query
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("CommunityOne/one-nonprofits")
# Access specific tables (splits)
orgs = dataset["organizations"]
financials = dataset["financials"]
locations = dataset["locations"]
print(f"Total organizations: {len(orgs):,}")
# Output: Total organizations: 1,952,238
Convert to Pandas
import pandas as pd
# Load as pandas DataFrame
df = pd.DataFrame(dataset["organizations"])
# Filter by state
alabama = df[df['state'] == 'AL']
print(f"Alabama nonprofits: {len(alabama):,}")
# Output: Alabama nonprofits: 26,148
# Filter by NTEE category (E = Health)
health = df[df['ntee_code'].str.startswith('E', na=False)]
print(f"Health organizations: {len(health):,}")
# Output: Health organizations: 80,000+
Search by Keywords
# Search for "dental" in organization names
dental = df[df['name'].str.contains('dental', case=False, na=False)]
print(f"Dental organizations: {len(dental):,}")
# Filter dental orgs in California
ca_dental = dental[dental['state'] == 'CA']
print(f"California dental orgs: {len(ca_dental):,}")
Join Tables
# Join organizations with financials
orgs_df = pd.DataFrame(dataset["organizations"])
fin_df = pd.DataFrame(dataset["financials"])
# Merge on EIN
combined = orgs_df.merge(fin_df, on='ein', how='left')
# Find high-revenue health organizations in NY
ny_health = combined[
(combined['state'] == 'NY') &
(combined['ntee_code'].str.startswith('E', na=False)) &
(combined['revenue_amount'] > 1_000_000)
]
print(f"High-revenue NY health orgs: {len(ny_health):,}")
π Step 3: Query from React/JavaScript
Install Utility
The HuggingFace query utility is already created at frontend/src/utils/huggingface.ts.
Basic Usage
import { fetchHFRows, searchHFDataset } from '../utils/huggingface';
// Fetch first 100 nonprofits
const response = await fetchHFRows({
dataset: "CommunityOne/one-nonprofits",
split: "organizations"
}, 0, 100);
const nonprofits = response.rows.map(r => r.row);
console.log(`Loaded ${nonprofits.length} nonprofits`);
console.log(`Total available: ${response.num_rows_total:,}`);
Search with React Query
import { useQuery } from '@tanstack/react-query';
import { searchNonprofits } from '../utils/huggingface';
function NonprofitSearch() {
const [searchTerm, setSearchTerm] = useState('dental');
const [state, setState] = useState('CA');
const { data: nonprofits, isLoading } = useQuery({
queryKey: ['nonprofits', searchTerm, state],
queryFn: async () => {
return await searchNonprofits({
dataset: "CommunityOne/one-nonprofits",
query: searchTerm,
state: state,
limit: 100
});
}
});
if (isLoading) return <div>Loading...</div>;
return (
<div>
<h2>Found {nonprofits?.length} nonprofits</h2>
{nonprofits?.map(org => (
<div key={org.ein}>
<h3>{org.name}</h3>
<p>NTEE: {org.ntee_code} | State: {org.state}</p>
</div>
))}
</div>
);
}
Pagination Example
import { useState } from 'react';
import { fetchHFRows } from '../utils/huggingface';
function NonprofitList() {
const [page, setPage] = useState(0);
const pageSize = 100;
const { data, isLoading } = useQuery({
queryKey: ['nonprofits', page],
queryFn: async () => {
return await fetchHFRows({
dataset: "CommunityOne/one-nonprofits",
split: "organizations"
}, page * pageSize, pageSize);
}
});
return (
<div>
{/* Display nonprofits */}
{data?.rows.map(r => (
<div key={r.row.ein}>{r.row.name}</div>
))}
{/* Pagination controls */}
<button onClick={() => setPage(p => Math.max(0, p - 1))}>
Previous
</button>
<span>Page {page + 1}</span>
<button onClick={() => setPage(p => p + 1)}>
Next
</button>
</div>
);
}
π Step 4: Update Existing Pages
Update Nonprofits Page
Edit frontend/src/pages/Nonprofits.tsx:
import { useQuery } from '@tanstack/react-query';
import { searchNonprofits } from '../utils/huggingface';
const DATASET_NAME = "CommunityOne/one-nonprofits";
export default function Nonprofits() {
const [state, setState] = useState<string>('');
const [nteeCode, setNteeCode] = useState<string>('');
const [searchQuery, setSearchQuery] = useState<string>('');
const { data: nonprofits, isLoading } = useQuery({
queryKey: ['nonprofits', state, nteeCode, searchQuery],
queryFn: async () => {
return await searchNonprofits({
dataset: DATASET_NAME,
query: searchQuery || undefined,
state: state || undefined,
nteeCode: nteeCode || undefined,
limit: 100
});
}
});
return (
<div className="p-6">
<h1>Nonprofits ({nonprofits?.length || 0} found)</h1>
{/* Filters */}
<div className="filters">
<input
type="text"
placeholder="Search by name..."
value={searchQuery}
onChange={e => setSearchQuery(e.target.value)}
/>
<select value={state} onChange={e => setState(e.target.value)}>
<option value="">All States</option>
<option value="AL">Alabama</option>
<option value="CA">California</option>
<option value="NY">New York</option>
{/* Add all 50 states */}
</select>
<select value={nteeCode} onChange={e => setNteeCode(e.target.value)}>
<option value="">All Categories</option>
<option value="E">Health (E)</option>
<option value="P">Human Services (P)</option>
<option value="X">Religion (X)</option>
{/* Add all NTEE codes */}
</select>
</div>
{/* Results */}
{isLoading ? (
<div>Loading...</div>
) : (
<div className="results">
{nonprofits?.map(org => (
<div key={org.ein} className="nonprofit-card">
<h3>{org.name}</h3>
<p>EIN: {org.ein}</p>
<p>NTEE: {org.ntee_code}</p>
<p>Location: {org.city}, {org.state} {org.zip_code}</p>
{org.revenue_amount && (
<p>Revenue: ${org.revenue_amount.toLocaleString()}</p>
)}
</div>
))}
</div>
)}
</div>
);
}
π Step 5: Add Advanced Features
Autocomplete Search
import { useState, useEffect } from 'react';
import { searchHFDataset } from '../utils/huggingface';
function NonprofitAutocomplete() {
const [query, setQuery] = useState('');
const [suggestions, setSuggestions] = useState<any[]>([]);
useEffect(() => {
if (query.length < 3) {
setSuggestions([]);
return;
}
const fetchSuggestions = async () => {
const response = await searchHFDataset({
dataset: "CommunityOne/one-nonprofits",
split: "organizations"
}, query, 0, 10);
setSuggestions(response.rows.map(r => r.row));
};
const timeoutId = setTimeout(fetchSuggestions, 300);
return () => clearTimeout(timeoutId);
}, [query]);
return (
<div>
<input
type="text"
value={query}
onChange={e => setQuery(e.target.value)}
placeholder="Search nonprofits..."
/>
{suggestions.length > 0 && (
<ul>
{suggestions.map(org => (
<li key={org.ein}>{org.name} - {org.city}, {org.state}</li>
))}
</ul>
)}
</div>
);
}
Map Visualization
import { useQuery } from '@tanstack/react-query';
import { fetchNonprofitsByState } from '../utils/huggingface';
function NonprofitMap() {
const [selectedState, setSelectedState] = useState('CA');
const { data: nonprofits } = useQuery({
queryKey: ['nonprofits-map', selectedState],
queryFn: async () => {
return await fetchNonprofitsByState(
"CommunityOne/one-nonprofits",
selectedState,
1000
);
}
});
return (
<div>
<select value={selectedState} onChange={e => setSelectedState(e.target.value)}>
{/* State options */}
</select>
<Map
markers={nonprofits?.map(org => ({
lat: org.latitude,
lng: org.longitude,
name: org.name
}))}
/>
</div>
);
}
π API Reference
Python Functions
from datasets import load_dataset
import pandas as pd
# Load dataset
dataset = load_dataset("CommunityOne/one-nonprofits")
# Get specific split
orgs = dataset["organizations"]
financials = dataset["financials"]
programs = dataset["programs"]
locations = dataset["locations"]
# Convert to pandas
df = pd.DataFrame(orgs)
# Filter
filtered = df[df['state'] == 'CA']
# Search
results = df[df['name'].str.contains('dental', case=False)]
JavaScript Functions
import {
fetchHFRows, // Fetch paginated rows
searchHFDataset, // Full-text search
getHFDatasetSize, // Get total row count
fetchAllNonprofits, // Fetch multiple pages
fetchNonprofitsByState,// Filter by state
fetchNonprofitsByNTEE, // Filter by NTEE code
searchNonprofits // Combined search + filters
} from '../utils/huggingface';
REST API (No Auth Required!)
# Get first 100 organizations
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=100"
# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental&offset=0&length=100"
# Get dataset size
curl "https://datasets-server.huggingface.co/size?dataset=CommunityOne/one-nonprofits&config=default&split=organizations"
π― Next Steps
Upload your datasets:
python scripts/upload_nonprofits_to_hf.py --allTest the API:
curl "https://datasets-server.huggingface.co/rows?dataset=YOUR_USERNAME/YOUR_DATASET&config=default&split=organizations&offset=0&length=10"Update your React pages:
- Replace local API calls with HuggingFace queries
- Add pagination for large datasets
- Implement autocomplete search
- Create map visualizations
Monitor usage:
- Visit: https://huggingface.co/datasets/YOUR_USERNAME/YOUR_DATASET
- Check downloads, views, and API usage
π Additional Resources
- HuggingFace Datasets Docs: https://huggingface.co/docs/datasets
- Datasets Server API: https://huggingface.co/docs/datasets-server
- IRS EO-BMF Data Source: https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf
- NTEE Codes Reference: IRS Bulk Data Integration