Spaces:
Sleeping
DAHS_2 β Hugging Face Space Upload & Run Guide
End-to-end procedure to run the Q1 training pipeline on a Hugging Face Space with bulletproof artifact persistence to a Hub model repo.
0. Recommended hardware tier
This project is CPU-bound (SimPy + scikit-learn + XGBoost on tabular data). Do NOT select a GPU tier β it will burn your credits at 5β10Γ the cost without any speedup.
| Tier | Approx $/hr | Pipeline time (5000 scen, 1000 eval seeds) |
|---|---|---|
| CPU upgrade (16 vCPU, 64 GB) | ~$0.05β0.10 | ~2β4 h β recommended |
| CPU basic (2 vCPU, 16 GB) | free | ~12 h (works, just slow) |
| Any GPU | $1+/hr | identical wall time, all GPUs idle |
At 16 vCPU you should finish a full Q1 run for well under $1 of your $23.
1. Files to upload to the Space
Upload the entire repository tree below. Do NOT upload __pycache__/,
.pytest_cache/, .git/, node_modules/, website/dist/, or local
models//data//results/ folders β those are produced by the run and
pushed to the model repo automatically.
DAHS_2/
βββ Dockerfile
βββ requirements.txt
βββ README.md
βββ HF_UPLOAD_GUIDE.md
βββ server.py # only needed if you also serve the demo from the Space
βββ start.py
βββ src/
β βββ __init__.py
β βββ data_generator.py
β βββ evaluator.py
β βββ features.py
β βββ heuristics.py
β βββ hf_persistence.py β new β bulletproof Hub uploader
β βββ hybrid_scheduler.py
β βββ presets.py
β βββ references.py
β βββ simulator.py
β βββ train_priority.py
β βββ train_selector.py
βββ scripts/
β βββ hf_runner.py β Space entrypoint (matches Dockerfile CMD)
β βββ run_pipeline.py
β βββ calibrate_real_data.py
β βββ foolproof_retrain.py
β βββ run_preset_benchmark.py
β βββ download_hf_artifacts.py
βββ tests/ # optional but small; keep for paper reproducibility
βββ data/ # only data/benchmarks/* if you have curated benchmarks;
# data/raw/ is regenerated each run
The pipeline writes to and pushes the following to your model repo:
<your-username>/DAHS-Models/
βββ data/raw/selector_dataset.csv
βββ data/raw/priority_dataset.csv
βββ models/selector_dt.joblib
βββ models/selector_rf.joblib
βββ models/selector_xgb.joblib
βββ models/priority_gbr.joblib
βββ models/feature_names.json
βββ models/feature_ranges.json
βββ models/dt_structure.json
βββ results/run_manifest.json
βββ results/pip_freeze.txt
βββ results/run_status.txt
βββ results/selector_metrics.json
βββ results/selector_metrics_table.csv
βββ results/priority_metrics.json
βββ results/benchmark_results.csv
βββ results/benchmark_summary.json
βββ results/statistical_tests.json
βββ results/switching_analysis.json
βββ results/paper_summary_table.csv
βββ results/plots/*.png
2. Create the model repo (one-time)
This is where artifacts go and survive runtime termination.
- Go to https://huggingface.co/new β choose Model, not Space.
- Owner: your username. Name:
DAHS-Models. Visibility: your choice. - Click Create repository. Done β keep it empty; the run populates it.
Note the full id: your-username/DAHS-Models.
3. Create a fine-grained access token
- https://huggingface.co/settings/tokens β Create new token β Fine-grained.
- Repository permissions β click Add repository β select
your-username/DAHS-Modelsβ check Write access to contents and discussions. - (Optional) also grant Manage repo to the Space if you want auto-pause on completion.
- Copy the token starting with
hf_β¦β you'll paste it in step 5.
4. Create the Space
- https://huggingface.co/new-space β name
DAHS-Training. - SDK: Docker.
- Hardware: pick CPU upgrade (16 vCPU, 64 GB RAM).
- Visibility: your choice. Click Create Space.
5. Configure secrets (Space β Settings β Variables and secrets)
| Name | Type | Value |
|---|---|---|
HF_TOKEN |
Secret | hf_β¦ token from step 3 |
REPO_ID |
Variable | your-username/DAHS-Models |
SPACE_ID |
Variable | your-username/DAHS-Training (auto-pause target) |
DAHS_SCENARIOS |
Variable (optional) | Override default 5000 scenarios |
DAHS_EVAL_SEEDS |
Variable (optional) | Override default 1000 eval seeds |
SPACE_ID controls auto-pause after the run; without it you must pause
manually to stop billing.
6. Push the code to the Space
From the project root, with your Hub credentials configured:
git lfs install # only once per machine
git remote add space https://huggingface.co/spaces/your-username/DAHS-Training
git add Dockerfile requirements.txt src/ scripts/ tests/
git add README.md HF_UPLOAD_GUIDE.md server.py start.py
git commit -m "DAHS_2 Q1 pipeline"
git push space main
Alternatively, drag the files into the Space's web file browser. Either
way, the Dockerfile at the repo root is what the Space builds, and its
CMD ["python", "scripts/hf_runner.py"] is the entrypoint.
7. Watch the build and run
- Space opens β Logs tab shows Docker build (3β5 min on first push).
- Once the container starts you should see:
--- DAHS_2 HF RUNNER STARTING --- CPUs : 16, workers=15 Repo : your-username/DAHS-Models [hub] periodic uploader started (every 300s) [ok] dummy health server on :7860 --- PIPELINE: 5000 scenarios, 1000 eval seeds, 15 workers --- - Within ~5 min the model repo should receive its first commit
(
results/run_manifest.jsonandresults/pip_freeze.txt). Verify athttps://huggingface.co/your-username/DAHS-Models/commits/main. If no commit appears in 10 minutes β the token or REPO_ID is wrong. Stop the Space immediately and re-check step 3 / 5. - New commits land every 5 minutes. Per-step commits (
selector_dataset,priority_dataset,selector_models,priority_model,evaluation) land as each pipeline phase finishes.
Total expected wall time on 16 vCPU: 2β4 hours.
8. After the run
results/run_status.txtwill readSUCCESSorFAILED (exit N).- The Space auto-pauses if
SPACE_IDwas set. Verify the Status badge showsPausedso you stop being billed. - All artifacts are in
your-username/DAHS-Models. Pull them locally with:
or via:python scripts/download_hf_artifacts.pyfrom huggingface_hub import snapshot_download snapshot_download(repo_id="your-username/DAHS-Models", local_dir="./pulled_artifacts")
9. What survives if the runtime is killed mid-run?
Three independent persistence layers protect against the previous "models disappeared" failure:
| Layer | Trigger | What it uploads |
|---|---|---|
| Per-step | After each pipeline phase | The folder produced by that phase |
| Periodic | Every 5 min (background thread) | All of data/, models/, results/, logs/ |
| Terminal | SIGTERM / SIGINT / atexit |
Final consolidated upload |
Worst-case loss: ~5 min of work between periodic uploads, never the whole run. Each upload is retried with exponential backoff (4 attempts) so a flaky Hub call won't lose state.
10. Sanity-check checklist before clicking "Run"
Before you spend any credits, verify the local checks pass:
# from repo root
pip install -r requirements.txt
python -c "from src.hf_persistence import HubPersistor, from_env; print('OK')"
python -m pytest tests/ -q # unit tests
python scripts/run_pipeline.py --quick # 50 scenarios, 20 eval seeds
# finishes in ~2-3 minutes locally
If --quick produces models/*.joblib, results/selector_metrics.json,
results/priority_metrics.json, results/benchmark_summary.json, and
results/paper_summary_table.csv, the pipeline is verified end-to-end.
You can then push to the Space with confidence.
11. Re-running
To re-run with different scenario/seed counts without rebuilding:
- Open the Space β Settings β Variables and secrets
- Edit
DAHS_SCENARIOS/DAHS_EVAL_SEEDS - Restart Space (not Factory rebuild β much faster)
Each re-run produces a new commit on the model repo, so you can compare runs side-by-side without overwriting prior artifacts.