Disruption-System / HF_UPLOAD_GUIDE.md
Vittal-M's picture
Upload 66 files
906e104 verified
# DAHS_2 β€” Hugging Face Space Upload & Run Guide
End-to-end procedure to run the Q1 training pipeline on a Hugging Face Space
with bulletproof artifact persistence to a Hub model repo.
---
## 0. Recommended hardware tier
This project is **CPU-bound** (SimPy + scikit-learn + XGBoost on tabular data).
Do **NOT** select a GPU tier β€” it will burn your credits at 5–10Γ— the cost
without any speedup.
| Tier | Approx $/hr | Pipeline time (5000 scen, 1000 eval seeds) |
|-------------------------|-------------|---------------------------------------------|
| **CPU upgrade (16 vCPU, 64 GB)** | **~$0.05–0.10** | **~2–4 h** ← recommended |
| CPU basic (2 vCPU, 16 GB) | free | ~12 h (works, just slow) |
| Any GPU | $1+/hr | identical wall time, all GPUs idle |
At 16 vCPU you should finish a full Q1 run for **well under $1** of your $23.
---
## 1. Files to upload to the Space
Upload the **entire repository tree below**. Do NOT upload `__pycache__/`,
`.pytest_cache/`, `.git/`, `node_modules/`, `website/dist/`, or local
`models/`/`data/`/`results/` folders β€” those are produced by the run and
pushed to the model repo automatically.
```
DAHS_2/
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
β”œβ”€β”€ HF_UPLOAD_GUIDE.md
β”œβ”€β”€ server.py # only needed if you also serve the demo from the Space
β”œβ”€β”€ start.py
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ data_generator.py
β”‚ β”œβ”€β”€ evaluator.py
β”‚ β”œβ”€β”€ features.py
β”‚ β”œβ”€β”€ heuristics.py
β”‚ β”œβ”€β”€ hf_persistence.py ← new β€” bulletproof Hub uploader
β”‚ β”œβ”€β”€ hybrid_scheduler.py
β”‚ β”œβ”€β”€ presets.py
β”‚ β”œβ”€β”€ references.py
β”‚ β”œβ”€β”€ simulator.py
β”‚ β”œβ”€β”€ train_priority.py
β”‚ └── train_selector.py
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ hf_runner.py ← Space entrypoint (matches Dockerfile CMD)
β”‚ β”œβ”€β”€ run_pipeline.py
β”‚ β”œβ”€β”€ calibrate_real_data.py
β”‚ β”œβ”€β”€ foolproof_retrain.py
β”‚ β”œβ”€β”€ run_preset_benchmark.py
β”‚ └── download_hf_artifacts.py
β”œβ”€β”€ tests/ # optional but small; keep for paper reproducibility
└── data/ # only data/benchmarks/* if you have curated benchmarks;
# data/raw/ is regenerated each run
```
The pipeline writes to and pushes the following to your **model repo**:
```
<your-username>/DAHS-Models/
β”œβ”€β”€ data/raw/selector_dataset.csv
β”œβ”€β”€ data/raw/priority_dataset.csv
β”œβ”€β”€ models/selector_dt.joblib
β”œβ”€β”€ models/selector_rf.joblib
β”œβ”€β”€ models/selector_xgb.joblib
β”œβ”€β”€ models/priority_gbr.joblib
β”œβ”€β”€ models/feature_names.json
β”œβ”€β”€ models/feature_ranges.json
β”œβ”€β”€ models/dt_structure.json
β”œβ”€β”€ results/run_manifest.json
β”œβ”€β”€ results/pip_freeze.txt
β”œβ”€β”€ results/run_status.txt
β”œβ”€β”€ results/selector_metrics.json
β”œβ”€β”€ results/selector_metrics_table.csv
β”œβ”€β”€ results/priority_metrics.json
β”œβ”€β”€ results/benchmark_results.csv
β”œβ”€β”€ results/benchmark_summary.json
β”œβ”€β”€ results/statistical_tests.json
β”œβ”€β”€ results/switching_analysis.json
β”œβ”€β”€ results/paper_summary_table.csv
└── results/plots/*.png
```
---
## 2. Create the model repo (one-time)
This is where artifacts go and **survive runtime termination**.
1. Go to https://huggingface.co/new β€” choose **Model**, not Space.
2. Owner: your username. Name: `DAHS-Models`. Visibility: your choice.
3. Click **Create repository**. Done β€” keep it empty; the run populates it.
Note the full id: `your-username/DAHS-Models`.
---
## 3. Create a fine-grained access token
1. https://huggingface.co/settings/tokens β†’ **Create new token** β†’ **Fine-grained**.
2. **Repository permissions** β†’ click **Add repository** β†’ select `your-username/DAHS-Models` β†’ check **Write access to contents and discussions**.
3. (Optional) also grant **Manage repo** to the Space if you want auto-pause on completion.
4. Copy the token starting with `hf_…` β€” you'll paste it in step 5.
---
## 4. Create the Space
1. https://huggingface.co/new-space β†’ name `DAHS-Training`.
2. **SDK**: Docker.
3. **Hardware**: pick **CPU upgrade** (16 vCPU, 64 GB RAM).
4. Visibility: your choice. Click **Create Space**.
---
## 5. Configure secrets (Space β†’ Settings β†’ Variables and secrets)
| Name | Type | Value |
|-------------|--------|-------------------------------------------------|
| `HF_TOKEN` | Secret | `hf_…` token from step 3 |
| `REPO_ID` | Variable | `your-username/DAHS-Models` |
| `SPACE_ID` | Variable | `your-username/DAHS-Training` (auto-pause target) |
| `DAHS_SCENARIOS` | Variable (optional) | Override default 5000 scenarios |
| `DAHS_EVAL_SEEDS` | Variable (optional) | Override default 1000 eval seeds |
`SPACE_ID` controls auto-pause after the run; without it you must pause
manually to stop billing.
---
## 6. Push the code to the Space
From the project root, with your Hub credentials configured:
```bash
git lfs install # only once per machine
git remote add space https://huggingface.co/spaces/your-username/DAHS-Training
git add Dockerfile requirements.txt src/ scripts/ tests/
git add README.md HF_UPLOAD_GUIDE.md server.py start.py
git commit -m "DAHS_2 Q1 pipeline"
git push space main
```
Alternatively, drag the files into the Space's web file browser. Either
way, the **Dockerfile** at the repo root is what the Space builds, and its
`CMD ["python", "scripts/hf_runner.py"]` is the entrypoint.
---
## 7. Watch the build and run
1. Space opens β†’ **Logs** tab shows Docker build (3–5 min on first push).
2. Once the container starts you should see:
```
--- DAHS_2 HF RUNNER STARTING ---
CPUs : 16, workers=15
Repo : your-username/DAHS-Models
[hub] periodic uploader started (every 300s)
[ok] dummy health server on :7860
--- PIPELINE: 5000 scenarios, 1000 eval seeds, 15 workers ---
```
3. Within ~5 min the model repo should receive its first commit
(`results/run_manifest.json` and `results/pip_freeze.txt`). Verify at
`https://huggingface.co/your-username/DAHS-Models/commits/main`.
**If no commit appears in 10 minutes β€” the token or REPO_ID is wrong.
Stop the Space immediately and re-check step 3 / 5.**
4. New commits land every 5 minutes. Per-step commits (`selector_dataset`,
`priority_dataset`, `selector_models`, `priority_model`, `evaluation`)
land as each pipeline phase finishes.
Total expected wall time on 16 vCPU: **2–4 hours**.
---
## 8. After the run
* `results/run_status.txt` will read `SUCCESS` or `FAILED (exit N)`.
* The Space auto-pauses if `SPACE_ID` was set. Verify the **Status** badge
shows `Paused` so you stop being billed.
* All artifacts are in `your-username/DAHS-Models`. Pull them locally with:
```bash
python scripts/download_hf_artifacts.py
```
or via:
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="your-username/DAHS-Models",
local_dir="./pulled_artifacts")
```
---
## 9. What survives if the runtime is killed mid-run?
Three independent persistence layers protect against the previous "models
disappeared" failure:
| Layer | Trigger | What it uploads |
|-------|---------|------------------|
| **Per-step** | After each pipeline phase | The folder produced by that phase |
| **Periodic** | Every 5 min (background thread) | All of `data/`, `models/`, `results/`, `logs/` |
| **Terminal** | SIGTERM / SIGINT / `atexit` | Final consolidated upload |
Worst-case loss: ~5 min of work between periodic uploads, **never the whole
run**. Each upload is retried with exponential backoff (4 attempts) so a
flaky Hub call won't lose state.
---
## 10. Sanity-check checklist before clicking "Run"
Before you spend any credits, verify the local checks pass:
```bash
# from repo root
pip install -r requirements.txt
python -c "from src.hf_persistence import HubPersistor, from_env; print('OK')"
python -m pytest tests/ -q # unit tests
python scripts/run_pipeline.py --quick # 50 scenarios, 20 eval seeds
# finishes in ~2-3 minutes locally
```
If `--quick` produces `models/*.joblib`, `results/selector_metrics.json`,
`results/priority_metrics.json`, `results/benchmark_summary.json`, and
`results/paper_summary_table.csv`, the pipeline is verified end-to-end.
You can then push to the Space with confidence.
---
## 11. Re-running
To re-run with different scenario/seed counts without rebuilding:
1. Open the Space β†’ **Settings β†’ Variables and secrets**
2. Edit `DAHS_SCENARIOS` / `DAHS_EVAL_SEEDS`
3. **Restart Space** (not Factory rebuild β€” much faster)
Each re-run produces a new commit on the model repo, so you can compare
runs side-by-side without overwriting prior artifacts.