Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # β HuggingFace Dataset Sharing Added! | |
| ## What's New | |
| You can now **publish your jurisdiction discovery datasets to HuggingFace Hub** for public sharing and collaboration! | |
| --- | |
| ## π― New Capabilities | |
| ### 1. **HuggingFace Publisher Module** | |
| - File: [pipeline/huggingface_publisher.py](../pipeline/huggingface_publisher.py) | |
| - Publishes datasets to HuggingFace Hub | |
| - Supports all discovery data layers (Bronze/Silver/Gold) | |
| ### 2. **CLI Command** | |
| ```bash | |
| python main.py publish-to-hf --dataset all | |
| ``` | |
| ### 3. **5 Publishable Datasets** | |
| - `census-gid` - Census Bureau GID (90,735 jurisdictions) | |
| - `gov-domains` - CISA .gov domains (15,000+) | |
| - `nces-schools` - NCES school districts (13,000+) | |
| - `discovered-urls` - Discovered URLs with metadata | |
| - `scraping-targets` - Prioritized scraping targets | |
| --- | |
| ## π¦ Files Added/Updated | |
| ### New Files | |
| - β [pipeline/huggingface_publisher.py](../pipeline/huggingface_publisher.py) - HuggingFace publisher (~400 lines) | |
| - β [docs/HUGGINGFACE_PUBLISHING.md](HUGGINGFACE_PUBLISHING.md) - Complete publishing guide | |
| ### Updated Files | |
| - β [requirements.txt](../requirements.txt) - Added `datasets>=2.16.0` and `huggingface-hub>=0.20.0` | |
| - β [config/settings.py](../config/settings.py) - Added `huggingface_token`, `hf_organization`, `hf_dataset_prefix` | |
| - β [.env.example](../.env.example) - Added HuggingFace configuration | |
| - β [main.py](../main.py) - Added `publish-to-hf` CLI command | |
| - β [README.md](../README.md) - Added HuggingFace publishing section | |
| --- | |
| ## π Quick Start | |
| ### 1. Get HuggingFace Token | |
| Visit: https://huggingface.co/settings/tokens | |
| Create a **Write** token | |
| ### 2. Configure | |
| Add to `.env`: | |
| ```bash | |
| HUGGINGFACE_TOKEN=hf_your_write_token_here | |
| HF_ORGANIZATION=CommunityOne | |
| HF_DATASET_PREFIX=oral-health-policy-pulse | |
| ``` | |
| ### 3. Install Dependencies | |
| ```bash | |
| pip install datasets huggingface-hub | |
| ``` | |
| ### 4. Publish | |
| ```bash | |
| # Publish all datasets | |
| python main.py publish-to-hf --dataset all | |
| # Or publish individually | |
| python main.py publish-to-hf --dataset census | |
| python main.py publish-to-hf --dataset discovered-urls | |
| ``` | |
| --- | |
| ## π What Gets Published | |
| ### Dataset URLs | |
| Your datasets will be available at: | |
| - https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-census-gid | |
| - https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-gov-domains | |
| - https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-nces-schools | |
| - https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-discovered-urls | |
| - https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-scraping-targets | |
| ### Public Access | |
| Anyone can load your datasets: | |
| ```python | |
| from datasets import load_dataset | |
| # Load census data | |
| census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid") | |
| # Load discovered URLs | |
| urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls") | |
| # Access specific split | |
| counties = census["counties"] | |
| print(f"Total counties: {len(counties)}") | |
| ``` | |
| --- | |
| ## π‘ Use Cases | |
| ### For Researchers | |
| ```python | |
| # Analyze jurisdiction coverage | |
| from datasets import load_dataset | |
| import pandas as pd | |
| census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid") | |
| df = pd.DataFrame(census["municipalities"]) | |
| # Cities by state | |
| df.groupby("state_name")["population"].sum().sort_values(ascending=False) | |
| ``` | |
| ### For Civic Hackers | |
| ```python | |
| # Get all county .gov domains | |
| domains = load_dataset("CommunityOne/oral-health-policy-pulse-gov-domains") | |
| counties = domains.filter(lambda x: x['Domain Type'] == 'County') | |
| ``` | |
| ### For Data Scientists | |
| ```python | |
| # High-confidence discovered URLs | |
| urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls") | |
| high_conf = urls.filter(lambda x: x['confidence_score'] > 0.8) | |
| ``` | |
| --- | |
| ## π Update Workflow | |
| ### After Each Discovery Run | |
| ```bash | |
| # Run discovery | |
| python main.py discover-jurisdictions | |
| # Publish updated datasets | |
| python main.py publish-to-hf --dataset discovered-urls | |
| python main.py publish-to-hf --dataset scraping-targets | |
| ``` | |
| ### Monthly Source Data Updates | |
| ```bash | |
| # Re-ingest source data | |
| python main.py discover-jurisdictions | |
| # Publish refreshed datasets | |
| python main.py publish-to-hf --dataset census | |
| python main.py publish-to-hf --dataset gov-domains | |
| python main.py publish-to-hf --dataset nces-schools | |
| ``` | |
| --- | |
| ## π― CLI Options | |
| ```bash | |
| # Publish all datasets | |
| python main.py publish-to-hf --dataset all | |
| # Publish specific dataset | |
| python main.py publish-to-hf --dataset census | |
| python main.py publish-to-hf --dataset gov-domains | |
| python main.py publish-to-hf --dataset nces-schools | |
| python main.py publish-to-hf --dataset discovered-urls | |
| python main.py publish-to-hf --dataset scraping-targets | |
| # Make datasets private | |
| python main.py publish-to-hf --dataset all --private | |
| # Sample census data (faster for testing) | |
| python main.py publish-to-hf --dataset census --sample | |
| ``` | |
| --- | |
| ## π Privacy & Security | |
| ### What's Safe to Publish | |
| β **Public Data:** | |
| - Census Bureau GID (already public) | |
| - CISA .gov domains (already public) | |
| - NCES school districts (already public) | |
| - Discovered government URLs (public websites) | |
| - Scraping targets (public information) | |
| β οΈ **Use `--private` for:** | |
| - Scraped meeting minutes content | |
| - Internal analysis results | |
| - Custom annotations | |
| β **Never Publish:** | |
| - Personal information (PII) | |
| - API keys or tokens | |
| - Internal comments/notes | |
| ### Token Security | |
| - Store token in `.env` file (gitignored) | |
| - Use write token (not fine-grained) | |
| - Revoke token if compromised | |
| --- | |
| ## π Documentation | |
| Complete guide: [HUGGINGFACE_PUBLISHING.md](HUGGINGFACE_PUBLISHING.md) | |
| Covers: | |
| - Detailed setup instructions | |
| - Dataset structure and schemas | |
| - Programmatic publishing in Python | |
| - Loading datasets in Python/R | |
| - Collaboration features | |
| - Troubleshooting | |
| --- | |
| ## π Community Impact | |
| **By publishing your datasets, you enable:** | |
| - π Reproducible research on government accessibility | |
| - π€ Cross-project collaboration | |
| - π Discovery of missing government websites | |
| - π Tracking government digital infrastructure over time | |
| - π Educational use for civic tech training | |
| **Your jurisdiction discovery data helps the entire civic tech community!** π | |
| --- | |
| ## β Benefits | |
| | Feature | Before | After | | |
| |---------|--------|-------| | |
| | **Data Storage** | Local only | Local + HuggingFace Hub | | |
| | **Data Sharing** | Manual export | One-command publish | | |
| | **Collaboration** | Email/Dropbox | Public datasets w/ versioning | | |
| | **Discovery** | None | Searchable on HuggingFace | | |
| | **Access** | Your team only | Anyone worldwide | | |
| | **Versioning** | Manual | Automatic Git-style tracking | | |
| --- | |
| **Ready to share your jurisdiction discovery data with the world!** ππ¦·β¨ | |