jcbowyer's picture
Clean HuggingFace deployment without binary files
61d29fc
# HuggingFace Dataset Management
Scripts for preparing and uploading datasets to HuggingFace.
## Setup & Configuration
### check-hf-vars.py
Verify HuggingFace environment variables are properly configured.
**Usage:**
```bash
python scripts/huggingface/check-hf-vars.py
```
### setup-huggingface.sh
Initial setup for HuggingFace integration (credentials, organization).
**Usage:**
```bash
./scripts/huggingface/setup-huggingface.sh
```
## Preparation
### reorganize_for_huggingface.py
Reorganizes data files into HuggingFace-compatible structure.
**Usage:**
```bash
python scripts/huggingface/reorganize_for_huggingface.py
```
### finalize_huggingface_structure.py
Final validation and preparation of HuggingFace datasets.
**Usage:**
```bash
python scripts/huggingface/finalize_huggingface_structure.py
```
## Upload Scripts
### upload_to_huggingface.py
**Main upload script** - uploads all datasets to HuggingFace.
**Usage:**
```bash
python scripts/huggingface/upload_to_huggingface.py
```
**Requirements:**
- HuggingFace token in environment
- HF_ORGANIZATION set in .env
### Specific Uploads
- `upload_nonprofits_to_hf.py` - Upload nonprofit datasets
- `upload_meetings_to_hf.py` - Upload meeting datasets
- `upload_state_splits_to_hf.py` - Upload state-partitioned data
## Publishing & Deployment
### deploy-huggingface.sh
**Main deployment script** - builds and deploys to HuggingFace Spaces.
**Usage:**
```bash
./scripts/huggingface/deploy-huggingface.sh
```
### publish_gold_datasets.py
Publish processed gold datasets to HuggingFace.
**Usage:**
```bash
python scripts/huggingface/publish_gold_datasets.py
```
### delete_and_publish_all_datasets.py
**Dangerous!** Deletes and republishes all datasets (fresh start).
**Usage:**
```bash
python scripts/huggingface/delete_and_publish_all_datasets.py
```
## Error Recovery
### retry_failed_datasets.py
Retry uploading datasets that failed previously.
**Usage:**
```bash
python scripts/huggingface/retry_failed_datasets.py
```
### fix_and_publish_failed.py
Fix and republish specific failed datasets.
**Usage:**
```bash
python scripts/huggingface/fix_and_publish_failed.py
```
## Maintenance
### hf-dataset-cleanup.sh
Clean up old/orphaned HuggingFace datasets.
**Usage:**
```bash
./scripts/huggingface/hf-dataset-cleanup.sh
```
### force-hf-rebuild.sh
Force complete rebuild and reupload (clears cache).
**Usage:**
```bash
./scripts/huggingface/force-hf-rebuild.sh
```
## Workflow
1. Setup: `setup-huggingface.sh`
2. Check config: `check-hf-vars.py`
3. Prepare data: `reorganize_for_huggingface.py`
4. Finalize: `finalize_huggingface_structure.py`
5. Upload: `upload_to_huggingface.py`
6. Deploy: `deploy-huggingface.sh`
## Environment Variables
Required in `.env`:
```bash
HF_ORGANIZATION=CommunityOne
HF_USERNAME=CommunityOne
HF_TOKEN=hf_...
```