Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 8,637 Bytes
61d29fc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 | ---
sidebar_position: 9
---
# π HuggingFace Dataset Integration - Quick Start Guide
## π Overview
You now have **3 new files** to push your 1.9M+ nonprofit datasets to HuggingFace and query them from React:
1. **`scripts/upload_nonprofits_to_hf.py`** - Upload script
2. **`frontend/src/utils/huggingface.ts`** - TypeScript API client
3. **`frontend/src/pages/NonprofitsHF.tsx`** - Example React page
4. **`website/docs/guides/huggingface-datasets.md`** - Complete documentation
---
## β‘ Quick Start (5 Steps)
### Step 1: Get HuggingFace Token
1. Visit: https://huggingface.co/settings/tokens
2. Click "New token"
3. Name it: `oral-health-upload`
4. Permission: **Write**
5. Copy the token (starts with `hf_...`)
### Step 2: Set Environment Variable
```bash
# Add to .env file
echo 'HUGGINGFACE_TOKEN=hf_YOUR_TOKEN_HERE' >> .env
# Or export for current session
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN_HERE"
```
### Step 3: Install Dependencies
```bash
# Python dependencies
pip install huggingface_hub datasets pyarrow
# Already installed in your project
# datasets and huggingface-hub
```
### Step 4: Upload Datasets
```bash
cd /home/developer/projects/open-navigator
# Upload all 4 nonprofit tables
python scripts/upload_nonprofits_to_hf.py --all
# Output:
# β
Logged in to Hugging Face
# β
Repository ready: https://huggingface.co/datasets/CommunityOne/one-nonprofits
# π€ Uploading organizations from data/gold/nonprofits_organizations.parquet
# Rows: 1,952,238
# Columns: 28
# Size: 156.43 MB
# β
Uploaded organizations: 1,952,238 records
# ... (uploads financials, programs, locations)
# π All uploads complete!
```
**What gets uploaded:**
- `nonprofits_organizations.parquet` β 1.9M+ orgs (split: "organizations")
- `nonprofits_financials.parquet` β Financial data (split: "financials")
- `nonprofits_programs.parquet` β Programs (split: "programs")
- `nonprofits_locations.parquet` β Locations (split: "locations")
### Step 5: Test the Dataset
```bash
# Test with curl (no auth required for public datasets!)
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=10" | jq .
# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental" | jq .
```
Expected response:
```json
{
"features": [...],
"rows": [
{
"row_idx": 0,
"row": {
"ein": "630123456",
"name": "ALABAMA DENTAL ASSOCIATION",
"city": "MONTGOMERY",
"state": "AL",
"ntee_code": "E12",
...
}
}
],
"num_rows_total": 1952238,
"num_rows_per_page": 100
}
```
---
## π Using in React
### Option A: Replace Current Nonprofits Page
```bash
# Backup current page
mv frontend/src/pages/Nonprofits.tsx frontend/src/pages/Nonprofits.backup.tsx
# Use HuggingFace version
mv frontend/src/pages/NonprofitsHF.tsx frontend/src/pages/Nonprofits.tsx
```
### Option B: Add New Route
Edit `frontend/src/App.tsx`:
```typescript
import NonprofitsHF from './pages/NonprofitsHF'
// Add route
<Route path="/nonprofits-hf" element={<NonprofitsHF />} />
```
### Test Locally
```bash
cd frontend
npm run dev
# Visit: http://localhost:5173/nonprofits
# or: http://localhost:5173/nonprofits-hf
```
---
## π Query Examples
### Python
```python
from datasets import load_dataset
import pandas as pd
# Load dataset
dataset = load_dataset("CommunityOne/one-nonprofits")
# Get organizations table
orgs = dataset["organizations"]
# Convert to pandas
df = pd.DataFrame(orgs)
# Filter by state
alabama = df[df['state'] == 'AL']
print(f"Alabama nonprofits: {len(alabama):,}")
# Output: Alabama nonprofits: 26,148
# Filter by NTEE (E = Health)
health = df[df['ntee_code'].str.startswith('E', na=False)]
print(f"Health organizations: {len(health):,}")
# Output: Health organizations: 80,000+
# Search for "dental"
dental = df[df['name'].str.contains('dental', case=False, na=False)]
print(f"Dental organizations: {len(dental):,}")
```
### JavaScript/TypeScript
```typescript
import { searchNonprofits } from '../utils/huggingface'
// Search for dental orgs in California
const results = await searchNonprofits({
dataset: "CommunityOne/one-nonprofits",
query: "dental",
state: "CA",
nteeCode: "E",
limit: 100
})
console.log(`Found ${results.length} dental orgs in California`)
```
### REST API (curl)
```bash
# Get first 100 organizations
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=100"
# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental"
# Get dataset size
curl "https://datasets-server.huggingface.co/size?dataset=CommunityOne/one-nonprofits&config=default&split=organizations"
```
---
## π What's in the Dataset?
### organizations (main table)
- **Records:** 1,952,238
- **Fields:** ein, name, sort_name, city, state, zip_code, street_address, ntee_code, subsection_code, foundation_code, tax_exempt_status, deductibility_status, ruling_date, organization_code, activity_codes, group_exemption, affiliation_code, data_source
### financials
- **Records:** 1,952,238
- **Fields:** ein, asset_amount, income_amount, revenue_amount, tax_period
### programs
- **Records:** 1,952,238
- **Fields:** ein, activity_codes, group_exemption, affiliation_code
### locations
- **Records:** 1,952,238
- **Fields:** ein, street_address, city, state, zip_code
---
## π― Key Features
### β
FREE
- **Unlimited storage** (public datasets)
- **No authentication** required for reading
- **Free bandwidth** and API calls
### β
FAST
- **CDN-backed** by HuggingFace
- **Automatic caching**
- **Pagination** built-in (100 rows max per request)
### β
SEARCHABLE
- **Full-text search** included
- **Filter by columns** (state, NTEE code, etc.)
- **REST API** - works from any language
### β
SCALABLE
- **1.9M+ records** available instantly
- **No database** setup required
- **Global availability**
---
## π οΈ Customization
### Change Dataset Name
Edit `scripts/upload_nonprofits_to_hf.py`:
```python
# Line 84
self.repo_name = repo_name or "YOUR_USERNAME/YOUR_DATASET_NAME"
```
Then upload:
```bash
python scripts/upload_nonprofits_to_hf.py --all --repo "your-username/nonprofits"
```
### Update React Components
Edit `frontend/src/pages/NonprofitsHF.tsx`:
```typescript
// Line 115
const DATASET_NAME = "your-username/nonprofits"
```
---
## π Documentation
### Full Guide
- **Location:** `website/docs/guides/huggingface-datasets.md`
- **URL:** http://localhost:3000/docs/guides/huggingface-datasets
### HuggingFace Docs
- **Datasets:** https://huggingface.co/docs/datasets
- **API:** https://huggingface.co/docs/datasets-server
- **Hub:** https://huggingface.co/docs/hub
### IRS Data Source
- **EO-BMF:** https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf
- **Search Tool:** https://www.irs.gov/charities-non-profits/tax-exempt-organization-search
---
## π§ Troubleshooting
### Error: "Hugging Face token required"
**Solution:**
```bash
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN"
# Or add to .env file
```
### Error: "File not found: nonprofits_organizations.parquet"
**Solution:** Generate the gold tables first:
```bash
python scripts/create_all_gold_tables.py --nonprofits-only --use-irs --download-all-irs
```
### Error: "Repository does not exist"
**Solution:** Change the repo name or create it manually:
1. Visit: https://huggingface.co/new-dataset
2. Name: `one-nonprofits`
3. License: CC0-1.0 (Public Domain)
4. Click "Create"
### Dataset shows 0 rows
**Solution:** Wait 5-10 minutes after upload for HuggingFace to process the dataset. Then refresh the viewer.
---
## π Next Steps
1. **Upload datasets:** `python scripts/upload_nonprofits_to_hf.py --all`
2. **Test API:** Visit https://huggingface.co/datasets/CommunityOne/one-nonprofits
3. **Update React app:** Use `NonprofitsHF.tsx` example
4. **Add features:**
- Map visualization with locations table
- Financial charts with financials table
- Advanced filters (subsection_code, foundation_code)
- Autocomplete search
- Export to CSV
---
## π§ Support
- **Documentation:** `website/docs/guides/huggingface-datasets.md`
- **HuggingFace Support:** https://discuss.huggingface.co
- **IRS EO-BMF Guide:** `website/docs/data-sources/irs-bulk-data.md`
**Happy querying! π**
|