open-navigator / docs /HUGGINGFACE_PUBLISHING.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified

HuggingFace Dataset Publishing Guide

Share your jurisdiction discovery datasets and run outputs on HuggingFace Hub for public collaboration!


🎯 What Gets Published

Available Datasets

Dataset Description Size Update Frequency
census-gid Census Bureau Government Integrated Directory 90,735 jurisdictions Annual
gov-domains CISA .gov domain master list 15,000+ domains Daily*
nces-schools NCES school district data 13,000+ districts Annual
discovered-urls Discovered government URLs with metadata Varies Per run
scraping-targets Prioritized scraping targets Varies Per run

* Daily on CISA side, you update as needed


πŸ”§ Setup

1. Get HuggingFace Token

Visit: https://huggingface.co/settings/tokens

Create a Write Token:

  1. Click "New token"
  2. Name: "open-navigator-upload"
  3. Token type: Write ⚠️ (required for publishing)
  4. Repository permissions: All repositories
  5. Copy the token (starts with hf_)

Why Write Access?

  • Creates dataset repositories on HuggingFace
  • Uploads Parquet files with your scraped data
  • Updates dataset cards and metadata
  • Read-only tokens cannot publish datasets

2. Configure Environment

Add to your .env file:

# HuggingFace Configuration
HUGGINGFACE_TOKEN=hf_your_write_token_here
HF_ORGANIZATION=CommunityOne  # Optional: your org name
HF_DATASET_PREFIX=open-navigator

3. Install Dependencies

pip install datasets huggingface-hub

πŸš€ Publishing Datasets

Publish All Datasets

python main.py publish-to-hf --dataset all

Output:

πŸš€ Publishing datasets to HuggingFace Hub...

πŸ“Š Published Datasets:
  βœ“ census: https://huggingface.co/datasets/CommunityOne/open-navigator-census-gid
  βœ“ gov_domains: https://huggingface.co/datasets/CommunityOne/open-navigator-gov-domains
  βœ“ nces_schools: https://huggingface.co/datasets/CommunityOne/open-navigator-nces-schools
  βœ“ discovered_urls: https://huggingface.co/datasets/CommunityOne/open-navigator-discovered-urls
  βœ“ scraping_targets: https://huggingface.co/datasets/CommunityOne/open-navigator-scraping-targets

πŸŽ‰ Publishing complete!

Publish Individual Datasets

# Publish census data only
python main.py publish-to-hf --dataset census

# Publish discovered URLs
python main.py publish-to-hf --dataset discovered-urls

# Publish .gov domains
python main.py publish-to-hf --dataset gov-domains

# Publish school districts
python main.py publish-to-hf --dataset nces-schools

# Publish scraping targets
python main.py publish-to-hf --dataset scraping-targets

Options

Make datasets private:

python main.py publish-to-hf --dataset all --private

Sample census data (faster for testing):

python main.py publish-to-hf --dataset census --sample

πŸ“¦ Programmatic Publishing

Use the publisher directly in Python:

from pipeline.huggingface_publisher import HuggingFacePublisher

# Initialize publisher
publisher = HuggingFacePublisher(token="hf_your_token")

# Publish specific dataset
result = publisher.publish_discovered_urls(private=False)
print(f"Published to: {result['url']}")

# Publish all datasets
results = publisher.publish_all(private=False, sample_census=False)
for name, info in results.items():
    print(f"{name}: {info['url']}")

🌐 Accessing Published Datasets

View on HuggingFace Hub

Visit your dataset pages:

Load in Python

from datasets import load_dataset

# Load census data
census = load_dataset("CommunityOne/open-navigator-census-gid")

# Load discovered URLs
urls = load_dataset("CommunityOne/open-navigator-discovered-urls")

# Access specific split
counties = census["counties"]
print(f"Total counties: {len(counties)}")

Load in R

library(datasets)

# Load dataset
census <- load_dataset("CommunityOne/open-navigator-census-gid")

# View data
head(census$counties)

Access via API

curl https://datasets-server.huggingface.co/rows \
  -d dataset=CommunityOne/open-navigator-census-gid \
  -d config=counties \
  -d split=train

πŸ“Š Dataset Structure

Census GID

Splits: counties, municipalities, townships, school_districts, special_districts

Columns:

  • jurisdiction_id: Unique identifier
  • jurisdiction_name: Official name
  • state_name: State
  • county_name: County (if applicable)
  • population: Population count
  • fips_code: FIPS code

.gov Domains

Single split: train

Columns:

  • Domain Name: Official .gov domain
  • Domain Type: City, County, State, School District, etc.
  • Organization Name: Government entity name
  • State: State abbreviation

Discovered URLs

Single split: train

Columns:

  • jurisdiction_id: Link to jurisdiction
  • jurisdiction_name: Government entity
  • state: State
  • homepage_url: Discovered homepage
  • minutes_url: Meeting minutes page (if found)
  • discovery_method: gsa_registry, pattern_match, not_found
  • confidence_score: 0.0-1.0
  • cms_platform: Granicus, CivicClerk, etc. (if detected)
  • last_verified: Timestamp

πŸ”„ Update Workflow

After Each Discovery Run

# Run discovery
python main.py discover-jurisdictions

# Publish updated datasets
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets

Monthly Updates

# Re-ingest source data
python main.py discover-jurisdictions --bronze-only

# Publish refreshed datasets
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools

πŸ“ Dataset Cards

Each published dataset includes auto-generated metadata:

dataset_info:
  features:
    - name: jurisdiction_name
      dtype: string
    - name: state
      dtype: string
  splits:
    - name: train
      num_examples: 90735
  
license: cc-by-4.0
task_categories:
  - text-classification
  - information-retrieval
language:
  - en
tags:
  - government
  - open-data
  - civic-tech
  - jurisdiction-discovery
  - oral-health-policy

🀝 Collaboration Features

Dataset Discussions

Enable community discussions on your dataset pages for:

  • Questions and answers
  • Error reporting
  • Feature requests
  • Use case sharing

Versioning

HuggingFace automatically tracks versions:

  • Each push creates a new commit
  • View version history on dataset page
  • Pin to specific version in code:
dataset = load_dataset(
    "CommunityOne/open-navigator-discovered-urls",
    revision="main"  # or specific commit hash
)

Dataset Viewer

HuggingFace provides automatic dataset preview:

  • Browse first 100 rows
  • Filter and search
  • Export to CSV/JSON
  • Embed in documentation

πŸ’‘ Best Practices

Privacy Considerations

  • βœ… Public datasets: Census, CISA, NCES data (already public)
  • βœ… Discovered URLs: Government website URLs (public)
  • ⚠️ Scraped content: Consider using --private flag
  • ❌ PII data: Never publish personal information

Storage Limits

  • Free tier: Unlimited public datasets
  • Size limit: ~100GB per dataset (contact HF for larger)
  • Recommend splitting very large datasets

Naming Conventions

Your datasets will be named:

{organization}/{prefix}-{dataset-name}

Examples:
  CommunityOne/open-navigator-census-gid
  CommunityOne/open-navigator-discovered-urls

πŸ” Use Cases

For Researchers:

# Load all discovered government URLs
urls = load_dataset("CommunityOne/open-navigator-discovered-urls")
high_confidence = urls.filter(lambda x: x['confidence_score'] > 0.8)

For Civic Hackers:

# Get all .gov domains by type
domains = load_dataset("CommunityOne/open-navigator-gov-domains")
counties = domains.filter(lambda x: x['Domain Type'] == 'County')

For Data Scientists:

# Analyze jurisdiction coverage
census = load_dataset("CommunityOne/open-navigator-census-gid")
import pandas as pd
df = pd.DataFrame(census["counties"])
df.groupby("state_name")["population"].sum()

🎯 Example: Complete Publishing Workflow

# 1. Run discovery
python main.py discover-jurisdictions --limit 1000

# 2. Check what you have
python main.py discovery-stats

# 3. Test publish with sample data
python main.py publish-to-hf --dataset census --sample --private

# 4. Publish public datasets
python main.py publish-to-hf --dataset all

# 5. View on HuggingFace
open https://huggingface.co/datasets/CommunityOne/open-navigator-discovered-urls

πŸ†˜ Troubleshooting

Authentication Error

❌ Configuration error: HuggingFace token required

Solution: Set HUGGINGFACE_TOKEN in .env file

Repository Not Found

❌ Failed to create repo: 404 Not Found

Solution:

  • Check organization name in .env
  • Verify token has write access
  • Create organization on HuggingFace first

Import Error

❌ HuggingFace libraries not installed!

Solution:

pip install datasets huggingface-hub

Large Dataset Timeout

For very large datasets (>1M rows), publish in batches:

publisher = HuggingFacePublisher()
publisher.publish_census_data(sample_size=100000)  # Publish 100k at a time

πŸ“š Additional Resources


Ready to share your jurisdiction discovery data with the world! 🌍🦷✨