open-navigator / docs /HUGGINGFACE_PUBLISHING.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# HuggingFace Dataset Publishing Guide
Share your jurisdiction discovery datasets and run outputs on HuggingFace Hub for public collaboration!
---
## 🎯 What Gets Published
### Available Datasets
| Dataset | Description | Size | Update Frequency |
|---------|-------------|------|------------------|
| **census-gid** | Census Bureau Government Integrated Directory | 90,735 jurisdictions | Annual |
| **gov-domains** | CISA .gov domain master list | 15,000+ domains | Daily* |
| **nces-schools** | NCES school district data | 13,000+ districts | Annual |
| **discovered-urls** | Discovered government URLs with metadata | Varies | Per run |
| **scraping-targets** | Prioritized scraping targets | Varies | Per run |
\* Daily on CISA side, you update as needed
---
## πŸ”§ Setup
### 1. Get HuggingFace Token
Visit: https://huggingface.co/settings/tokens
**Create a Write Token:**
1. Click "New token"
2. **Name:** "open-navigator-upload"
3. **Token type:** Write ⚠️ (required for publishing)
4. **Repository permissions:** All repositories
5. Copy the token (starts with `hf_`)
**Why Write Access?**
- Creates dataset repositories on HuggingFace
- Uploads Parquet files with your scraped data
- Updates dataset cards and metadata
- Read-only tokens cannot publish datasets
### 2. Configure Environment
Add to your `.env` file:
```bash
# HuggingFace Configuration
HUGGINGFACE_TOKEN=hf_your_write_token_here
HF_ORGANIZATION=CommunityOne # Optional: your org name
HF_DATASET_PREFIX=open-navigator
```
### 3. Install Dependencies
```bash
pip install datasets huggingface-hub
```
---
## πŸš€ Publishing Datasets
### Publish All Datasets
```bash
python main.py publish-to-hf --dataset all
```
**Output:**
```
πŸš€ Publishing datasets to HuggingFace Hub...
πŸ“Š Published Datasets:
βœ“ census: https://huggingface.co/datasets/CommunityOne/open-navigator-census-gid
βœ“ gov_domains: https://huggingface.co/datasets/CommunityOne/open-navigator-gov-domains
βœ“ nces_schools: https://huggingface.co/datasets/CommunityOne/open-navigator-nces-schools
βœ“ discovered_urls: https://huggingface.co/datasets/CommunityOne/open-navigator-discovered-urls
βœ“ scraping_targets: https://huggingface.co/datasets/CommunityOne/open-navigator-scraping-targets
πŸŽ‰ Publishing complete!
```
### Publish Individual Datasets
```bash
# Publish census data only
python main.py publish-to-hf --dataset census
# Publish discovered URLs
python main.py publish-to-hf --dataset discovered-urls
# Publish .gov domains
python main.py publish-to-hf --dataset gov-domains
# Publish school districts
python main.py publish-to-hf --dataset nces-schools
# Publish scraping targets
python main.py publish-to-hf --dataset scraping-targets
```
### Options
**Make datasets private:**
```bash
python main.py publish-to-hf --dataset all --private
```
**Sample census data (faster for testing):**
```bash
python main.py publish-to-hf --dataset census --sample
```
---
## πŸ“¦ Programmatic Publishing
Use the publisher directly in Python:
```python
from pipeline.huggingface_publisher import HuggingFacePublisher
# Initialize publisher
publisher = HuggingFacePublisher(token="hf_your_token")
# Publish specific dataset
result = publisher.publish_discovered_urls(private=False)
print(f"Published to: {result['url']}")
# Publish all datasets
results = publisher.publish_all(private=False, sample_census=False)
for name, info in results.items():
print(f"{name}: {info['url']}")
```
---
## 🌐 Accessing Published Datasets
### View on HuggingFace Hub
Visit your dataset pages:
- https://huggingface.co/datasets/YOUR_ORG/open-navigator-census-gid
- https://huggingface.co/datasets/YOUR_ORG/open-navigator-gov-domains
- https://huggingface.co/datasets/YOUR_ORG/open-navigator-discovered-urls
### Load in Python
```python
from datasets import load_dataset
# Load census data
census = load_dataset("CommunityOne/open-navigator-census-gid")
# Load discovered URLs
urls = load_dataset("CommunityOne/open-navigator-discovered-urls")
# Access specific split
counties = census["counties"]
print(f"Total counties: {len(counties)}")
```
### Load in R
```r
library(datasets)
# Load dataset
census <- load_dataset("CommunityOne/open-navigator-census-gid")
# View data
head(census$counties)
```
### Access via API
```bash
curl https://datasets-server.huggingface.co/rows \
-d dataset=CommunityOne/open-navigator-census-gid \
-d config=counties \
-d split=train
```
---
## πŸ“Š Dataset Structure
### Census GID
**Splits:** `counties`, `municipalities`, `townships`, `school_districts`, `special_districts`
**Columns:**
- `jurisdiction_id`: Unique identifier
- `jurisdiction_name`: Official name
- `state_name`: State
- `county_name`: County (if applicable)
- `population`: Population count
- `fips_code`: FIPS code
### .gov Domains
**Single split:** `train`
**Columns:**
- `Domain Name`: Official .gov domain
- `Domain Type`: City, County, State, School District, etc.
- `Organization Name`: Government entity name
- `State`: State abbreviation
### Discovered URLs
**Single split:** `train`
**Columns:**
- `jurisdiction_id`: Link to jurisdiction
- `jurisdiction_name`: Government entity
- `state`: State
- `homepage_url`: Discovered homepage
- `minutes_url`: Meeting minutes page (if found)
- `discovery_method`: gsa_registry, pattern_match, not_found
- `confidence_score`: 0.0-1.0
- `cms_platform`: Granicus, CivicClerk, etc. (if detected)
- `last_verified`: Timestamp
---
## πŸ”„ Update Workflow
### After Each Discovery Run
```bash
# Run discovery
python main.py discover-jurisdictions
# Publish updated datasets
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets
```
### Monthly Updates
```bash
# Re-ingest source data
python main.py discover-jurisdictions --bronze-only
# Publish refreshed datasets
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools
```
---
## πŸ“ Dataset Cards
Each published dataset includes auto-generated metadata:
```yaml
dataset_info:
features:
- name: jurisdiction_name
dtype: string
- name: state
dtype: string
splits:
- name: train
num_examples: 90735
license: cc-by-4.0
task_categories:
- text-classification
- information-retrieval
language:
- en
tags:
- government
- open-data
- civic-tech
- jurisdiction-discovery
- oral-health-policy
```
---
## 🀝 Collaboration Features
### Dataset Discussions
Enable community discussions on your dataset pages for:
- Questions and answers
- Error reporting
- Feature requests
- Use case sharing
### Versioning
HuggingFace automatically tracks versions:
- Each push creates a new commit
- View version history on dataset page
- Pin to specific version in code:
```python
dataset = load_dataset(
"CommunityOne/open-navigator-discovered-urls",
revision="main" # or specific commit hash
)
```
### Dataset Viewer
HuggingFace provides automatic dataset preview:
- Browse first 100 rows
- Filter and search
- Export to CSV/JSON
- Embed in documentation
---
## πŸ’‘ Best Practices
### Privacy Considerations
- βœ… **Public datasets:** Census, CISA, NCES data (already public)
- βœ… **Discovered URLs:** Government website URLs (public)
- ⚠️ **Scraped content:** Consider using `--private` flag
- ❌ **PII data:** Never publish personal information
### Storage Limits
- Free tier: Unlimited public datasets
- Size limit: ~100GB per dataset (contact HF for larger)
- Recommend splitting very large datasets
### Naming Conventions
Your datasets will be named:
```
{organization}/{prefix}-{dataset-name}
Examples:
CommunityOne/open-navigator-census-gid
CommunityOne/open-navigator-discovered-urls
```
---
## πŸ” Use Cases
**For Researchers:**
```python
# Load all discovered government URLs
urls = load_dataset("CommunityOne/open-navigator-discovered-urls")
high_confidence = urls.filter(lambda x: x['confidence_score'] > 0.8)
```
**For Civic Hackers:**
```python
# Get all .gov domains by type
domains = load_dataset("CommunityOne/open-navigator-gov-domains")
counties = domains.filter(lambda x: x['Domain Type'] == 'County')
```
**For Data Scientists:**
```python
# Analyze jurisdiction coverage
census = load_dataset("CommunityOne/open-navigator-census-gid")
import pandas as pd
df = pd.DataFrame(census["counties"])
df.groupby("state_name")["population"].sum()
```
---
## 🎯 Example: Complete Publishing Workflow
```bash
# 1. Run discovery
python main.py discover-jurisdictions --limit 1000
# 2. Check what you have
python main.py discovery-stats
# 3. Test publish with sample data
python main.py publish-to-hf --dataset census --sample --private
# 4. Publish public datasets
python main.py publish-to-hf --dataset all
# 5. View on HuggingFace
open https://huggingface.co/datasets/CommunityOne/open-navigator-discovered-urls
```
---
## πŸ†˜ Troubleshooting
### Authentication Error
```
❌ Configuration error: HuggingFace token required
```
**Solution:** Set `HUGGINGFACE_TOKEN` in `.env` file
### Repository Not Found
```
❌ Failed to create repo: 404 Not Found
```
**Solution:**
- Check organization name in `.env`
- Verify token has write access
- Create organization on HuggingFace first
### Import Error
```
❌ HuggingFace libraries not installed!
```
**Solution:**
```bash
pip install datasets huggingface-hub
```
### Large Dataset Timeout
For very large datasets (>1M rows), publish in batches:
```python
publisher = HuggingFacePublisher()
publisher.publish_census_data(sample_size=100000) # Publish 100k at a time
```
---
## πŸ“š Additional Resources
- **HuggingFace Datasets Docs:** https://huggingface.co/docs/datasets
- **Dataset Card Guide:** https://huggingface.co/docs/hub/datasets-cards
- **Hub Python Library:** https://huggingface.co/docs/huggingface_hub
---
**Ready to share your jurisdiction discovery data with the world!** 🌍🦷✨