open-navigator / website /docs /guides /huggingface-features.md
jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
# βœ… HuggingFace Dataset Sharing Added!
## What's New
You can now **publish your jurisdiction discovery datasets to HuggingFace Hub** for public sharing and collaboration!
---
## 🎯 New Capabilities
### 1. **HuggingFace Publisher Module**
- File: [pipeline/huggingface_publisher.py](../pipeline/huggingface_publisher.py)
- Publishes datasets to HuggingFace Hub
- Supports all discovery data layers (Bronze/Silver/Gold)
### 2. **CLI Command**
```bash
python main.py publish-to-hf --dataset all
```
### 3. **5 Publishable Datasets**
- `census-gid` - Census Bureau GID (90,735 jurisdictions)
- `gov-domains` - CISA .gov domains (15,000+)
- `nces-schools` - NCES school districts (13,000+)
- `discovered-urls` - Discovered URLs with metadata
- `scraping-targets` - Prioritized scraping targets
---
## πŸ“¦ Files Added/Updated
### New Files
- βœ… [pipeline/huggingface_publisher.py](../pipeline/huggingface_publisher.py) - HuggingFace publisher (~400 lines)
- βœ… [docs/HUGGINGFACE_PUBLISHING.md](HUGGINGFACE_PUBLISHING.md) - Complete publishing guide
### Updated Files
- βœ… [requirements.txt](../requirements.txt) - Added `datasets>=2.16.0` and `huggingface-hub>=0.20.0`
- βœ… [config/settings.py](../config/settings.py) - Added `huggingface_token`, `hf_organization`, `hf_dataset_prefix`
- βœ… [.env.example](../.env.example) - Added HuggingFace configuration
- βœ… [main.py](../main.py) - Added `publish-to-hf` CLI command
- βœ… [README.md](../README.md) - Added HuggingFace publishing section
---
## πŸš€ Quick Start
### 1. Get HuggingFace Token
Visit: https://huggingface.co/settings/tokens
Create a **Write** token
### 2. Configure
Add to `.env`:
```bash
HUGGINGFACE_TOKEN=hf_your_write_token_here
HF_ORGANIZATION=CommunityOne
HF_DATASET_PREFIX=oral-health-policy-pulse
```
### 3. Install Dependencies
```bash
pip install datasets huggingface-hub
```
### 4. Publish
```bash
# Publish all datasets
python main.py publish-to-hf --dataset all
# Or publish individually
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset discovered-urls
```
---
## πŸ“Š What Gets Published
### Dataset URLs
Your datasets will be available at:
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-census-gid
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-gov-domains
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-nces-schools
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-discovered-urls
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-scraping-targets
### Public Access
Anyone can load your datasets:
```python
from datasets import load_dataset
# Load census data
census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid")
# Load discovered URLs
urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls")
# Access specific split
counties = census["counties"]
print(f"Total counties: {len(counties)}")
```
---
## πŸ’‘ Use Cases
### For Researchers
```python
# Analyze jurisdiction coverage
from datasets import load_dataset
import pandas as pd
census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid")
df = pd.DataFrame(census["municipalities"])
# Cities by state
df.groupby("state_name")["population"].sum().sort_values(ascending=False)
```
### For Civic Hackers
```python
# Get all county .gov domains
domains = load_dataset("CommunityOne/oral-health-policy-pulse-gov-domains")
counties = domains.filter(lambda x: x['Domain Type'] == 'County')
```
### For Data Scientists
```python
# High-confidence discovered URLs
urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls")
high_conf = urls.filter(lambda x: x['confidence_score'] > 0.8)
```
---
## πŸ”„ Update Workflow
### After Each Discovery Run
```bash
# Run discovery
python main.py discover-jurisdictions
# Publish updated datasets
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets
```
### Monthly Source Data Updates
```bash
# Re-ingest source data
python main.py discover-jurisdictions
# Publish refreshed datasets
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools
```
---
## 🎯 CLI Options
```bash
# Publish all datasets
python main.py publish-to-hf --dataset all
# Publish specific dataset
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets
# Make datasets private
python main.py publish-to-hf --dataset all --private
# Sample census data (faster for testing)
python main.py publish-to-hf --dataset census --sample
```
---
## πŸ”’ Privacy & Security
### What's Safe to Publish
βœ… **Public Data:**
- Census Bureau GID (already public)
- CISA .gov domains (already public)
- NCES school districts (already public)
- Discovered government URLs (public websites)
- Scraping targets (public information)
⚠️ **Use `--private` for:**
- Scraped meeting minutes content
- Internal analysis results
- Custom annotations
❌ **Never Publish:**
- Personal information (PII)
- API keys or tokens
- Internal comments/notes
### Token Security
- Store token in `.env` file (gitignored)
- Use write token (not fine-grained)
- Revoke token if compromised
---
## πŸ“š Documentation
Complete guide: [HUGGINGFACE_PUBLISHING.md](HUGGINGFACE_PUBLISHING.md)
Covers:
- Detailed setup instructions
- Dataset structure and schemas
- Programmatic publishing in Python
- Loading datasets in Python/R
- Collaboration features
- Troubleshooting
---
## 🌍 Community Impact
**By publishing your datasets, you enable:**
- πŸ“Š Reproducible research on government accessibility
- 🀝 Cross-project collaboration
- πŸ” Discovery of missing government websites
- πŸ“ˆ Tracking government digital infrastructure over time
- πŸŽ“ Educational use for civic tech training
**Your jurisdiction discovery data helps the entire civic tech community!** πŸ™
---
## βœ… Benefits
| Feature | Before | After |
|---------|--------|-------|
| **Data Storage** | Local only | Local + HuggingFace Hub |
| **Data Sharing** | Manual export | One-command publish |
| **Collaboration** | Email/Dropbox | Public datasets w/ versioning |
| **Discovery** | None | Searchable on HuggingFace |
| **Access** | Your team only | Anyone worldwide |
| **Versioning** | Manual | Automatic Git-style tracking |
---
**Ready to share your jurisdiction discovery data with the world!** 🌍🦷✨