Spaces:
Sleeping
Sleeping
| # DAHS_2 β Hugging Face Space Upload & Run Guide | |
| End-to-end procedure to run the Q1 training pipeline on a Hugging Face Space | |
| with bulletproof artifact persistence to a Hub model repo. | |
| --- | |
| ## 0. Recommended hardware tier | |
| This project is **CPU-bound** (SimPy + scikit-learn + XGBoost on tabular data). | |
| Do **NOT** select a GPU tier β it will burn your credits at 5β10Γ the cost | |
| without any speedup. | |
| | Tier | Approx $/hr | Pipeline time (5000 scen, 1000 eval seeds) | | |
| |-------------------------|-------------|---------------------------------------------| | |
| | **CPU upgrade (16 vCPU, 64 GB)** | **~$0.05β0.10** | **~2β4 h** β recommended | | |
| | CPU basic (2 vCPU, 16 GB) | free | ~12 h (works, just slow) | | |
| | Any GPU | $1+/hr | identical wall time, all GPUs idle | | |
| At 16 vCPU you should finish a full Q1 run for **well under $1** of your $23. | |
| --- | |
| ## 1. Files to upload to the Space | |
| Upload the **entire repository tree below**. Do NOT upload `__pycache__/`, | |
| `.pytest_cache/`, `.git/`, `node_modules/`, `website/dist/`, or local | |
| `models/`/`data/`/`results/` folders β those are produced by the run and | |
| pushed to the model repo automatically. | |
| ``` | |
| DAHS_2/ | |
| βββ Dockerfile | |
| βββ requirements.txt | |
| βββ README.md | |
| βββ HF_UPLOAD_GUIDE.md | |
| βββ server.py # only needed if you also serve the demo from the Space | |
| βββ start.py | |
| βββ src/ | |
| β βββ __init__.py | |
| β βββ data_generator.py | |
| β βββ evaluator.py | |
| β βββ features.py | |
| β βββ heuristics.py | |
| β βββ hf_persistence.py β new β bulletproof Hub uploader | |
| β βββ hybrid_scheduler.py | |
| β βββ presets.py | |
| β βββ references.py | |
| β βββ simulator.py | |
| β βββ train_priority.py | |
| β βββ train_selector.py | |
| βββ scripts/ | |
| β βββ hf_runner.py β Space entrypoint (matches Dockerfile CMD) | |
| β βββ run_pipeline.py | |
| β βββ calibrate_real_data.py | |
| β βββ foolproof_retrain.py | |
| β βββ run_preset_benchmark.py | |
| β βββ download_hf_artifacts.py | |
| βββ tests/ # optional but small; keep for paper reproducibility | |
| βββ data/ # only data/benchmarks/* if you have curated benchmarks; | |
| # data/raw/ is regenerated each run | |
| ``` | |
| The pipeline writes to and pushes the following to your **model repo**: | |
| ``` | |
| <your-username>/DAHS-Models/ | |
| βββ data/raw/selector_dataset.csv | |
| βββ data/raw/priority_dataset.csv | |
| βββ models/selector_dt.joblib | |
| βββ models/selector_rf.joblib | |
| βββ models/selector_xgb.joblib | |
| βββ models/priority_gbr.joblib | |
| βββ models/feature_names.json | |
| βββ models/feature_ranges.json | |
| βββ models/dt_structure.json | |
| βββ results/run_manifest.json | |
| βββ results/pip_freeze.txt | |
| βββ results/run_status.txt | |
| βββ results/selector_metrics.json | |
| βββ results/selector_metrics_table.csv | |
| βββ results/priority_metrics.json | |
| βββ results/benchmark_results.csv | |
| βββ results/benchmark_summary.json | |
| βββ results/statistical_tests.json | |
| βββ results/switching_analysis.json | |
| βββ results/paper_summary_table.csv | |
| βββ results/plots/*.png | |
| ``` | |
| --- | |
| ## 2. Create the model repo (one-time) | |
| This is where artifacts go and **survive runtime termination**. | |
| 1. Go to https://huggingface.co/new β choose **Model**, not Space. | |
| 2. Owner: your username. Name: `DAHS-Models`. Visibility: your choice. | |
| 3. Click **Create repository**. Done β keep it empty; the run populates it. | |
| Note the full id: `your-username/DAHS-Models`. | |
| --- | |
| ## 3. Create a fine-grained access token | |
| 1. https://huggingface.co/settings/tokens β **Create new token** β **Fine-grained**. | |
| 2. **Repository permissions** β click **Add repository** β select `your-username/DAHS-Models` β check **Write access to contents and discussions**. | |
| 3. (Optional) also grant **Manage repo** to the Space if you want auto-pause on completion. | |
| 4. Copy the token starting with `hf_β¦` β you'll paste it in step 5. | |
| --- | |
| ## 4. Create the Space | |
| 1. https://huggingface.co/new-space β name `DAHS-Training`. | |
| 2. **SDK**: Docker. | |
| 3. **Hardware**: pick **CPU upgrade** (16 vCPU, 64 GB RAM). | |
| 4. Visibility: your choice. Click **Create Space**. | |
| --- | |
| ## 5. Configure secrets (Space β Settings β Variables and secrets) | |
| | Name | Type | Value | | |
| |-------------|--------|-------------------------------------------------| | |
| | `HF_TOKEN` | Secret | `hf_β¦` token from step 3 | | |
| | `REPO_ID` | Variable | `your-username/DAHS-Models` | | |
| | `SPACE_ID` | Variable | `your-username/DAHS-Training` (auto-pause target) | | |
| | `DAHS_SCENARIOS` | Variable (optional) | Override default 5000 scenarios | | |
| | `DAHS_EVAL_SEEDS` | Variable (optional) | Override default 1000 eval seeds | | |
| `SPACE_ID` controls auto-pause after the run; without it you must pause | |
| manually to stop billing. | |
| --- | |
| ## 6. Push the code to the Space | |
| From the project root, with your Hub credentials configured: | |
| ```bash | |
| git lfs install # only once per machine | |
| git remote add space https://huggingface.co/spaces/your-username/DAHS-Training | |
| git add Dockerfile requirements.txt src/ scripts/ tests/ | |
| git add README.md HF_UPLOAD_GUIDE.md server.py start.py | |
| git commit -m "DAHS_2 Q1 pipeline" | |
| git push space main | |
| ``` | |
| Alternatively, drag the files into the Space's web file browser. Either | |
| way, the **Dockerfile** at the repo root is what the Space builds, and its | |
| `CMD ["python", "scripts/hf_runner.py"]` is the entrypoint. | |
| --- | |
| ## 7. Watch the build and run | |
| 1. Space opens β **Logs** tab shows Docker build (3β5 min on first push). | |
| 2. Once the container starts you should see: | |
| ``` | |
| --- DAHS_2 HF RUNNER STARTING --- | |
| CPUs : 16, workers=15 | |
| Repo : your-username/DAHS-Models | |
| [hub] periodic uploader started (every 300s) | |
| [ok] dummy health server on :7860 | |
| --- PIPELINE: 5000 scenarios, 1000 eval seeds, 15 workers --- | |
| ``` | |
| 3. Within ~5 min the model repo should receive its first commit | |
| (`results/run_manifest.json` and `results/pip_freeze.txt`). Verify at | |
| `https://huggingface.co/your-username/DAHS-Models/commits/main`. | |
| **If no commit appears in 10 minutes β the token or REPO_ID is wrong. | |
| Stop the Space immediately and re-check step 3 / 5.** | |
| 4. New commits land every 5 minutes. Per-step commits (`selector_dataset`, | |
| `priority_dataset`, `selector_models`, `priority_model`, `evaluation`) | |
| land as each pipeline phase finishes. | |
| Total expected wall time on 16 vCPU: **2β4 hours**. | |
| --- | |
| ## 8. After the run | |
| * `results/run_status.txt` will read `SUCCESS` or `FAILED (exit N)`. | |
| * The Space auto-pauses if `SPACE_ID` was set. Verify the **Status** badge | |
| shows `Paused` so you stop being billed. | |
| * All artifacts are in `your-username/DAHS-Models`. Pull them locally with: | |
| ```bash | |
| python scripts/download_hf_artifacts.py | |
| ``` | |
| or via: | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| snapshot_download(repo_id="your-username/DAHS-Models", | |
| local_dir="./pulled_artifacts") | |
| ``` | |
| --- | |
| ## 9. What survives if the runtime is killed mid-run? | |
| Three independent persistence layers protect against the previous "models | |
| disappeared" failure: | |
| | Layer | Trigger | What it uploads | | |
| |-------|---------|------------------| | |
| | **Per-step** | After each pipeline phase | The folder produced by that phase | | |
| | **Periodic** | Every 5 min (background thread) | All of `data/`, `models/`, `results/`, `logs/` | | |
| | **Terminal** | SIGTERM / SIGINT / `atexit` | Final consolidated upload | | |
| Worst-case loss: ~5 min of work between periodic uploads, **never the whole | |
| run**. Each upload is retried with exponential backoff (4 attempts) so a | |
| flaky Hub call won't lose state. | |
| --- | |
| ## 10. Sanity-check checklist before clicking "Run" | |
| Before you spend any credits, verify the local checks pass: | |
| ```bash | |
| # from repo root | |
| pip install -r requirements.txt | |
| python -c "from src.hf_persistence import HubPersistor, from_env; print('OK')" | |
| python -m pytest tests/ -q # unit tests | |
| python scripts/run_pipeline.py --quick # 50 scenarios, 20 eval seeds | |
| # finishes in ~2-3 minutes locally | |
| ``` | |
| If `--quick` produces `models/*.joblib`, `results/selector_metrics.json`, | |
| `results/priority_metrics.json`, `results/benchmark_summary.json`, and | |
| `results/paper_summary_table.csv`, the pipeline is verified end-to-end. | |
| You can then push to the Space with confidence. | |
| --- | |
| ## 11. Re-running | |
| To re-run with different scenario/seed counts without rebuilding: | |
| 1. Open the Space β **Settings β Variables and secrets** | |
| 2. Edit `DAHS_SCENARIOS` / `DAHS_EVAL_SEEDS` | |
| 3. **Restart Space** (not Factory rebuild β much faster) | |
| Each re-run produces a new commit on the model repo, so you can compare | |
| runs side-by-side without overwriting prior artifacts. | |