--- sidebar_position: 8 --- # HuggingFace Dataset Integration Push your nonprofit data to HuggingFace Hub and query it from your React application using the **free** Datasets Server API (no authentication required for public datasets!). ## 🎯 Overview With 1.9M+ nonprofits now available from IRS EO-BMF, you can: 1. **Upload** all 4 nonprofit gold tables to HuggingFace (free unlimited storage) 2. **Query** datasets from React using HuggingFace Datasets Server API 3. **Search** nonprofits by name, state, NTEE code, or keywords 4. **Paginate** through millions of records efficiently **Key Benefits:** - ✅ **Free unlimited storage** (public datasets) - ✅ **No authentication required** for reading public datasets - ✅ **REST API** - works from any language (Python, JavaScript, curl) - ✅ **Automatic caching** and CDN delivery by HuggingFace - ✅ **Searchable** with full-text search built-in ## 📤 Step 1: Upload Datasets to HuggingFace ### Prerequisites ```bash # Install HuggingFace libraries pip install huggingface_hub datasets pyarrow # Get your token from https://huggingface.co/settings/tokens export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN_HERE" ``` Add to `.env`: ```bash HUGGINGFACE_TOKEN=hf_your_write_token_here ``` ### Upload All Nonprofit Tables ```bash cd /home/developer/projects/open-navigator # Upload all 4 tables (organizations, financials, programs, locations) python scripts/upload_nonprofits_to_hf.py --all # Upload specific table python scripts/upload_nonprofits_to_hf.py --table organizations # Upload to your own repo (change username) python scripts/upload_nonprofits_to_hf.py --all --repo "your-username/nonprofits" ``` **Expected Output:** ``` ✅ Logged in to Hugging Face ✅ Repository ready: https://huggingface.co/datasets/CommunityOne/one-nonprofits 📤 Uploading organizations from data/gold/nonprofits_organizations.parquet Rows: 1,952,238 Columns: 28 Size: 156.43 MB Pushing to CommunityOne/one-nonprofits (split: organizations) ✅ Uploaded organizations: 1,952,238 records View at: https://huggingface.co/datasets/CommunityOne/one-nonprofits/viewer/organizations 📤 Uploading financials from data/gold/nonprofits_financials.parquet ... 🎉 All uploads complete! ``` ### What Gets Uploaded | Table | Records | Description | |-------|---------|-------------| | **organizations** | 1.9M+ | Main nonprofit data (EIN, name, NTEE, subsection) | | **financials** | 1.9M+ | Assets, income, revenue, ruling date | | **programs** | 1.9M+ | Activity codes, group affiliation | | **locations** | 1.9M+ | Address, city, state, ZIP code | ## 🔍 Step 2: Query from Python ### Basic Query ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("CommunityOne/one-nonprofits") # Access specific tables (splits) orgs = dataset["organizations"] financials = dataset["financials"] locations = dataset["locations"] print(f"Total organizations: {len(orgs):,}") # Output: Total organizations: 1,952,238 ``` ### Convert to Pandas ```python import pandas as pd # Load as pandas DataFrame df = pd.DataFrame(dataset["organizations"]) # Filter by state alabama = df[df['state'] == 'AL'] print(f"Alabama nonprofits: {len(alabama):,}") # Output: Alabama nonprofits: 26,148 # Filter by NTEE category (E = Health) health = df[df['ntee_code'].str.startswith('E', na=False)] print(f"Health organizations: {len(health):,}") # Output: Health organizations: 80,000+ ``` ### Search by Keywords ```python # Search for "dental" in organization names dental = df[df['name'].str.contains('dental', case=False, na=False)] print(f"Dental organizations: {len(dental):,}") # Filter dental orgs in California ca_dental = dental[dental['state'] == 'CA'] print(f"California dental orgs: {len(ca_dental):,}") ``` ### Join Tables ```python # Join organizations with financials orgs_df = pd.DataFrame(dataset["organizations"]) fin_df = pd.DataFrame(dataset["financials"]) # Merge on EIN combined = orgs_df.merge(fin_df, on='ein', how='left') # Find high-revenue health organizations in NY ny_health = combined[ (combined['state'] == 'NY') & (combined['ntee_code'].str.startswith('E', na=False)) & (combined['revenue_amount'] > 1_000_000) ] print(f"High-revenue NY health orgs: {len(ny_health):,}") ``` ## 🌐 Step 3: Query from React/JavaScript ### Install Utility The HuggingFace query utility is already created at [`frontend/src/utils/huggingface.ts`](../../frontend/src/utils/huggingface.ts). ### Basic Usage ```typescript import { fetchHFRows, searchHFDataset } from '../utils/huggingface'; // Fetch first 100 nonprofits const response = await fetchHFRows({ dataset: "CommunityOne/one-nonprofits", split: "organizations" }, 0, 100); const nonprofits = response.rows.map(r => r.row); console.log(`Loaded ${nonprofits.length} nonprofits`); console.log(`Total available: ${response.num_rows_total:,}`); ``` ### Search with React Query ```typescript import { useQuery } from '@tanstack/react-query'; import { searchNonprofits } from '../utils/huggingface'; function NonprofitSearch() { const [searchTerm, setSearchTerm] = useState('dental'); const [state, setState] = useState('CA'); const { data: nonprofits, isLoading } = useQuery({ queryKey: ['nonprofits', searchTerm, state], queryFn: async () => { return await searchNonprofits({ dataset: "CommunityOne/one-nonprofits", query: searchTerm, state: state, limit: 100 }); } }); if (isLoading) return
Loading...
; return (

Found {nonprofits?.length} nonprofits

{nonprofits?.map(org => (

{org.name}

NTEE: {org.ntee_code} | State: {org.state}

))}
); } ``` ### Pagination Example ```typescript import { useState } from 'react'; import { fetchHFRows } from '../utils/huggingface'; function NonprofitList() { const [page, setPage] = useState(0); const pageSize = 100; const { data, isLoading } = useQuery({ queryKey: ['nonprofits', page], queryFn: async () => { return await fetchHFRows({ dataset: "CommunityOne/one-nonprofits", split: "organizations" }, page * pageSize, pageSize); } }); return (
{/* Display nonprofits */} {data?.rows.map(r => (
{r.row.name}
))} {/* Pagination controls */} Page {page + 1}
); } ``` ## 🔄 Step 4: Update Existing Pages ### Update Nonprofits Page Edit [`frontend/src/pages/Nonprofits.tsx`](../../frontend/src/pages/Nonprofits.tsx): ```typescript import { useQuery } from '@tanstack/react-query'; import { searchNonprofits } from '../utils/huggingface'; const DATASET_NAME = "CommunityOne/one-nonprofits"; export default function Nonprofits() { const [state, setState] = useState(''); const [nteeCode, setNteeCode] = useState(''); const [searchQuery, setSearchQuery] = useState(''); const { data: nonprofits, isLoading } = useQuery({ queryKey: ['nonprofits', state, nteeCode, searchQuery], queryFn: async () => { return await searchNonprofits({ dataset: DATASET_NAME, query: searchQuery || undefined, state: state || undefined, nteeCode: nteeCode || undefined, limit: 100 }); } }); return (

Nonprofits ({nonprofits?.length || 0} found)

{/* Filters */}
setSearchQuery(e.target.value)} />
{/* Results */} {isLoading ? (
Loading...
) : (
{nonprofits?.map(org => (

{org.name}

EIN: {org.ein}

NTEE: {org.ntee_code}

Location: {org.city}, {org.state} {org.zip_code}

{org.revenue_amount && (

Revenue: ${org.revenue_amount.toLocaleString()}

)}
))}
)}
); } ``` ## 📊 Step 5: Add Advanced Features ### Autocomplete Search ```typescript import { useState, useEffect } from 'react'; import { searchHFDataset } from '../utils/huggingface'; function NonprofitAutocomplete() { const [query, setQuery] = useState(''); const [suggestions, setSuggestions] = useState([]); useEffect(() => { if (query.length < 3) { setSuggestions([]); return; } const fetchSuggestions = async () => { const response = await searchHFDataset({ dataset: "CommunityOne/one-nonprofits", split: "organizations" }, query, 0, 10); setSuggestions(response.rows.map(r => r.row)); }; const timeoutId = setTimeout(fetchSuggestions, 300); return () => clearTimeout(timeoutId); }, [query]); return (
setQuery(e.target.value)} placeholder="Search nonprofits..." /> {suggestions.length > 0 && (
    {suggestions.map(org => (
  • {org.name} - {org.city}, {org.state}
  • ))}
)}
); } ``` ### Map Visualization ```typescript import { useQuery } from '@tanstack/react-query'; import { fetchNonprofitsByState } from '../utils/huggingface'; function NonprofitMap() { const [selectedState, setSelectedState] = useState('CA'); const { data: nonprofits } = useQuery({ queryKey: ['nonprofits-map', selectedState], queryFn: async () => { return await fetchNonprofitsByState( "CommunityOne/one-nonprofits", selectedState, 1000 ); } }); return (
({ lat: org.latitude, lng: org.longitude, name: org.name }))} />
); } ``` ## 🚀 API Reference ### Python Functions ```python from datasets import load_dataset import pandas as pd # Load dataset dataset = load_dataset("CommunityOne/one-nonprofits") # Get specific split orgs = dataset["organizations"] financials = dataset["financials"] programs = dataset["programs"] locations = dataset["locations"] # Convert to pandas df = pd.DataFrame(orgs) # Filter filtered = df[df['state'] == 'CA'] # Search results = df[df['name'].str.contains('dental', case=False)] ``` ### JavaScript Functions ```typescript import { fetchHFRows, // Fetch paginated rows searchHFDataset, // Full-text search getHFDatasetSize, // Get total row count fetchAllNonprofits, // Fetch multiple pages fetchNonprofitsByState,// Filter by state fetchNonprofitsByNTEE, // Filter by NTEE code searchNonprofits // Combined search + filters } from '../utils/huggingface'; ``` ### REST API (No Auth Required!) ```bash # Get first 100 organizations curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=100" # Search for "dental" curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental&offset=0&length=100" # Get dataset size curl "https://datasets-server.huggingface.co/size?dataset=CommunityOne/one-nonprofits&config=default&split=organizations" ``` ## 🎯 Next Steps 1. **Upload your datasets:** ```bash python scripts/upload_nonprofits_to_hf.py --all ``` 2. **Test the API:** ```bash curl "https://datasets-server.huggingface.co/rows?dataset=YOUR_USERNAME/YOUR_DATASET&config=default&split=organizations&offset=0&length=10" ``` 3. **Update your React pages:** - Replace local API calls with HuggingFace queries - Add pagination for large datasets - Implement autocomplete search - Create map visualizations 4. **Monitor usage:** - Visit: https://huggingface.co/datasets/YOUR_USERNAME/YOUR_DATASET - Check downloads, views, and API usage ## 📚 Additional Resources - **HuggingFace Datasets Docs:** https://huggingface.co/docs/datasets - **Datasets Server API:** https://huggingface.co/docs/datasets-server - **IRS EO-BMF Data Source:** https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf - **NTEE Codes Reference:** [IRS Bulk Data Integration](../data-sources/irs-bulk-data.md#ntee-national-taxonomy-of-exempt-entities)