Spaces:

Vittal-M
/

Disruption-System

Sleeping

App Files Files Community

Disruption-System / HF_UPLOAD_GUIDE.md

Vittal-M

Upload 66 files

906e104 verified about 1 month ago

preview code

raw

history blame contribute delete

9.09 kB

DAHS_2 — Hugging Face Space Upload & Run Guide

End-to-end procedure to run the Q1 training pipeline on a Hugging Face Space with bulletproof artifact persistence to a Hub model repo.

0. Recommended hardware tier

This project is CPU-bound (SimPy + scikit-learn + XGBoost on tabular data). Do NOT select a GPU tier — it will burn your credits at 5–10× the cost without any speedup.

Tier	Approx $/hr	Pipeline time (5000 scen, 1000 eval seeds)
CPU upgrade (16 vCPU, 64 GB)	~$0.05–0.10	~2–4 h ← recommended
CPU basic (2 vCPU, 16 GB)	free	~12 h (works, just slow)
Any GPU	$1+/hr	identical wall time, all GPUs idle

At 16 vCPU you should finish a full Q1 run for well under $1 of your $23.

1. Files to upload to the Space

Upload the entire repository tree below. Do NOT upload __pycache__/, .pytest_cache/, .git/, node_modules/, website/dist/, or local models//data//results/ folders — those are produced by the run and pushed to the model repo automatically.

DAHS_2/
├── Dockerfile
├── requirements.txt
├── README.md
├── HF_UPLOAD_GUIDE.md
├── server.py               # only needed if you also serve the demo from the Space
├── start.py
├── src/
│   ├── __init__.py
│   ├── data_generator.py
│   ├── evaluator.py
│   ├── features.py
│   ├── heuristics.py
│   ├── hf_persistence.py        ← new — bulletproof Hub uploader
│   ├── hybrid_scheduler.py
│   ├── presets.py
│   ├── references.py
│   ├── simulator.py
│   ├── train_priority.py
│   └── train_selector.py
├── scripts/
│   ├── hf_runner.py             ← Space entrypoint (matches Dockerfile CMD)
│   ├── run_pipeline.py
│   ├── calibrate_real_data.py
│   ├── foolproof_retrain.py
│   ├── run_preset_benchmark.py
│   └── download_hf_artifacts.py
├── tests/                       # optional but small; keep for paper reproducibility
└── data/                        # only data/benchmarks/* if you have curated benchmarks;
                                 # data/raw/ is regenerated each run

The pipeline writes to and pushes the following to your model repo:

<your-username>/DAHS-Models/
├── data/raw/selector_dataset.csv
├── data/raw/priority_dataset.csv
├── models/selector_dt.joblib
├── models/selector_rf.joblib
├── models/selector_xgb.joblib
├── models/priority_gbr.joblib
├── models/feature_names.json
├── models/feature_ranges.json
├── models/dt_structure.json
├── results/run_manifest.json
├── results/pip_freeze.txt
├── results/run_status.txt
├── results/selector_metrics.json
├── results/selector_metrics_table.csv
├── results/priority_metrics.json
├── results/benchmark_results.csv
├── results/benchmark_summary.json
├── results/statistical_tests.json
├── results/switching_analysis.json
├── results/paper_summary_table.csv
└── results/plots/*.png

2. Create the model repo (one-time)

This is where artifacts go and survive runtime termination.

Go to https://huggingface.co/new — choose Model, not Space.
Owner: your username. Name: DAHS-Models. Visibility: your choice.
Click Create repository. Done — keep it empty; the run populates it.

Note the full id: your-username/DAHS-Models.

3. Create a fine-grained access token

https://huggingface.co/settings/tokens → Create new token → Fine-grained.
Repository permissions → click Add repository → select your-username/DAHS-Models → check Write access to contents and discussions.
(Optional) also grant Manage repo to the Space if you want auto-pause on completion.
Copy the token starting with hf_… — you'll paste it in step 5.

4. Create the Space

https://huggingface.co/new-space → name DAHS-Training.
SDK: Docker.
Hardware: pick CPU upgrade (16 vCPU, 64 GB RAM).
Visibility: your choice. Click Create Space.

5. Configure secrets (Space → Settings → Variables and secrets)

Name	Type	Value
`HF_TOKEN`	Secret	`hf_…` token from step 3
`REPO_ID`	Variable	`your-username/DAHS-Models`
`SPACE_ID`	Variable	`your-username/DAHS-Training` (auto-pause target)
`DAHS_SCENARIOS`	Variable (optional)	Override default 5000 scenarios
`DAHS_EVAL_SEEDS`	Variable (optional)	Override default 1000 eval seeds

SPACE_ID controls auto-pause after the run; without it you must pause manually to stop billing.

6. Push the code to the Space

From the project root, with your Hub credentials configured:

git lfs install                                    # only once per machine
git remote add space https://huggingface.co/spaces/your-username/DAHS-Training
git add Dockerfile requirements.txt src/ scripts/ tests/
git add README.md HF_UPLOAD_GUIDE.md server.py start.py
git commit -m "DAHS_2 Q1 pipeline"
git push space main

Alternatively, drag the files into the Space's web file browser. Either way, the Dockerfile at the repo root is what the Space builds, and its CMD ["python", "scripts/hf_runner.py"] is the entrypoint.

7. Watch the build and run

Space opens → Logs tab shows Docker build (3–5 min on first push).

Once the container starts you should see:

--- DAHS_2 HF RUNNER STARTING ---
CPUs : 16, workers=15
Repo : your-username/DAHS-Models
[hub] periodic uploader started (every 300s)
[ok] dummy health server on :7860
--- PIPELINE: 5000 scenarios, 1000 eval seeds, 15 workers ---

Within ~5 min the model repo should receive its first commit (results/run_manifest.json and results/pip_freeze.txt). Verify at https://huggingface.co/your-username/DAHS-Models/commits/main. If no commit appears in 10 minutes — the token or REPO_ID is wrong. Stop the Space immediately and re-check step 3 / 5.
New commits land every 5 minutes. Per-step commits (selector_dataset, priority_dataset, selector_models, priority_model, evaluation) land as each pipeline phase finishes.

Total expected wall time on 16 vCPU: 2–4 hours.

8. After the run

results/run_status.txt will read SUCCESS or FAILED (exit N).
The Space auto-pauses if SPACE_ID was set. Verify the Status badge shows Paused so you stop being billed.

All artifacts are in your-username/DAHS-Models. Pull them locally with:

python scripts/download_hf_artifacts.py

or via:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="your-username/DAHS-Models",
                  local_dir="./pulled_artifacts")

9. What survives if the runtime is killed mid-run?

Three independent persistence layers protect against the previous "models disappeared" failure:

Layer	Trigger	What it uploads
Per-step	After each pipeline phase	The folder produced by that phase
Periodic	Every 5 min (background thread)	All of `data/`, `models/`, `results/`, `logs/`
Terminal	SIGTERM / SIGINT / `atexit`	Final consolidated upload

Worst-case loss: ~5 min of work between periodic uploads, never the whole run. Each upload is retried with exponential backoff (4 attempts) so a flaky Hub call won't lose state.

10. Sanity-check checklist before clicking "Run"

Before you spend any credits, verify the local checks pass:

# from repo root
pip install -r requirements.txt
python -c "from src.hf_persistence import HubPersistor, from_env; print('OK')"
python -m pytest tests/ -q                 # unit tests
python scripts/run_pipeline.py --quick     # 50 scenarios, 20 eval seeds
                                           # finishes in ~2-3 minutes locally

If --quick produces models/*.joblib, results/selector_metrics.json, results/priority_metrics.json, results/benchmark_summary.json, and results/paper_summary_table.csv, the pipeline is verified end-to-end. You can then push to the Space with confidence.

11. Re-running

To re-run with different scenario/seed counts without rebuilding:

Open the Space → Settings → Variables and secrets
Edit DAHS_SCENARIOS / DAHS_EVAL_SEEDS
Restart Space (not Factory rebuild — much faster)

Each re-run produces a new commit on the model repo, so you can compare runs side-by-side without overwriting prior artifacts.