open-navigator / website /docs /guides /huggingface-features.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc

βœ… HuggingFace Dataset Sharing Added!

What's New

You can now publish your jurisdiction discovery datasets to HuggingFace Hub for public sharing and collaboration!


🎯 New Capabilities

1. HuggingFace Publisher Module

2. CLI Command

python main.py publish-to-hf --dataset all

3. 5 Publishable Datasets

  • census-gid - Census Bureau GID (90,735 jurisdictions)
  • gov-domains - CISA .gov domains (15,000+)
  • nces-schools - NCES school districts (13,000+)
  • discovered-urls - Discovered URLs with metadata
  • scraping-targets - Prioritized scraping targets

πŸ“¦ Files Added/Updated

New Files

Updated Files

  • βœ… requirements.txt - Added datasets>=2.16.0 and huggingface-hub>=0.20.0
  • βœ… config/settings.py - Added huggingface_token, hf_organization, hf_dataset_prefix
  • βœ… .env.example - Added HuggingFace configuration
  • βœ… main.py - Added publish-to-hf CLI command
  • βœ… README.md - Added HuggingFace publishing section

πŸš€ Quick Start

1. Get HuggingFace Token

Visit: https://huggingface.co/settings/tokens

Create a Write token

2. Configure

Add to .env:

HUGGINGFACE_TOKEN=hf_your_write_token_here
HF_ORGANIZATION=CommunityOne
HF_DATASET_PREFIX=oral-health-policy-pulse

3. Install Dependencies

pip install datasets huggingface-hub

4. Publish

# Publish all datasets
python main.py publish-to-hf --dataset all

# Or publish individually
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset discovered-urls

πŸ“Š What Gets Published

Dataset URLs

Your datasets will be available at:

Public Access

Anyone can load your datasets:

from datasets import load_dataset

# Load census data
census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid")

# Load discovered URLs
urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls")

# Access specific split
counties = census["counties"]
print(f"Total counties: {len(counties)}")

πŸ’‘ Use Cases

For Researchers

# Analyze jurisdiction coverage
from datasets import load_dataset
import pandas as pd

census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid")
df = pd.DataFrame(census["municipalities"])

# Cities by state
df.groupby("state_name")["population"].sum().sort_values(ascending=False)

For Civic Hackers

# Get all county .gov domains
domains = load_dataset("CommunityOne/oral-health-policy-pulse-gov-domains")
counties = domains.filter(lambda x: x['Domain Type'] == 'County')

For Data Scientists

# High-confidence discovered URLs
urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls")
high_conf = urls.filter(lambda x: x['confidence_score'] > 0.8)

πŸ”„ Update Workflow

After Each Discovery Run

# Run discovery
python main.py discover-jurisdictions

# Publish updated datasets
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets

Monthly Source Data Updates

# Re-ingest source data
python main.py discover-jurisdictions

# Publish refreshed datasets
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools

🎯 CLI Options

# Publish all datasets
python main.py publish-to-hf --dataset all

# Publish specific dataset
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets

# Make datasets private
python main.py publish-to-hf --dataset all --private

# Sample census data (faster for testing)
python main.py publish-to-hf --dataset census --sample

πŸ”’ Privacy & Security

What's Safe to Publish

βœ… Public Data:

  • Census Bureau GID (already public)
  • CISA .gov domains (already public)
  • NCES school districts (already public)
  • Discovered government URLs (public websites)
  • Scraping targets (public information)

⚠️ Use --private for:

  • Scraped meeting minutes content
  • Internal analysis results
  • Custom annotations

❌ Never Publish:

  • Personal information (PII)
  • API keys or tokens
  • Internal comments/notes

Token Security

  • Store token in .env file (gitignored)
  • Use write token (not fine-grained)
  • Revoke token if compromised

πŸ“š Documentation

Complete guide: HUGGINGFACE_PUBLISHING.md

Covers:

  • Detailed setup instructions
  • Dataset structure and schemas
  • Programmatic publishing in Python
  • Loading datasets in Python/R
  • Collaboration features
  • Troubleshooting

🌍 Community Impact

By publishing your datasets, you enable:

  • πŸ“Š Reproducible research on government accessibility
  • 🀝 Cross-project collaboration
  • πŸ” Discovery of missing government websites
  • πŸ“ˆ Tracking government digital infrastructure over time
  • πŸŽ“ Educational use for civic tech training

Your jurisdiction discovery data helps the entire civic tech community! πŸ™


βœ… Benefits

Feature Before After
Data Storage Local only Local + HuggingFace Hub
Data Sharing Manual export One-command publish
Collaboration Email/Dropbox Public datasets w/ versioning
Discovery None Searchable on HuggingFace
Access Your team only Anyone worldwide
Versioning Manual Automatic Git-style tracking

Ready to share your jurisdiction discovery data with the world! 🌍🦷✨