Disruption-System / HF_UPLOAD_GUIDE.md
Vittal-M's picture
Upload 66 files
906e104 verified

DAHS_2 β€” Hugging Face Space Upload & Run Guide

End-to-end procedure to run the Q1 training pipeline on a Hugging Face Space with bulletproof artifact persistence to a Hub model repo.


0. Recommended hardware tier

This project is CPU-bound (SimPy + scikit-learn + XGBoost on tabular data). Do NOT select a GPU tier β€” it will burn your credits at 5–10Γ— the cost without any speedup.

Tier Approx $/hr Pipeline time (5000 scen, 1000 eval seeds)
CPU upgrade (16 vCPU, 64 GB) ~$0.05–0.10 ~2–4 h ← recommended
CPU basic (2 vCPU, 16 GB) free ~12 h (works, just slow)
Any GPU $1+/hr identical wall time, all GPUs idle

At 16 vCPU you should finish a full Q1 run for well under $1 of your $23.


1. Files to upload to the Space

Upload the entire repository tree below. Do NOT upload __pycache__/, .pytest_cache/, .git/, node_modules/, website/dist/, or local models//data//results/ folders β€” those are produced by the run and pushed to the model repo automatically.

DAHS_2/
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
β”œβ”€β”€ HF_UPLOAD_GUIDE.md
β”œβ”€β”€ server.py               # only needed if you also serve the demo from the Space
β”œβ”€β”€ start.py
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ data_generator.py
β”‚   β”œβ”€β”€ evaluator.py
β”‚   β”œβ”€β”€ features.py
β”‚   β”œβ”€β”€ heuristics.py
β”‚   β”œβ”€β”€ hf_persistence.py        ← new β€” bulletproof Hub uploader
β”‚   β”œβ”€β”€ hybrid_scheduler.py
β”‚   β”œβ”€β”€ presets.py
β”‚   β”œβ”€β”€ references.py
β”‚   β”œβ”€β”€ simulator.py
β”‚   β”œβ”€β”€ train_priority.py
β”‚   └── train_selector.py
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ hf_runner.py             ← Space entrypoint (matches Dockerfile CMD)
β”‚   β”œβ”€β”€ run_pipeline.py
β”‚   β”œβ”€β”€ calibrate_real_data.py
β”‚   β”œβ”€β”€ foolproof_retrain.py
β”‚   β”œβ”€β”€ run_preset_benchmark.py
β”‚   └── download_hf_artifacts.py
β”œβ”€β”€ tests/                       # optional but small; keep for paper reproducibility
└── data/                        # only data/benchmarks/* if you have curated benchmarks;
                                 # data/raw/ is regenerated each run

The pipeline writes to and pushes the following to your model repo:

<your-username>/DAHS-Models/
β”œβ”€β”€ data/raw/selector_dataset.csv
β”œβ”€β”€ data/raw/priority_dataset.csv
β”œβ”€β”€ models/selector_dt.joblib
β”œβ”€β”€ models/selector_rf.joblib
β”œβ”€β”€ models/selector_xgb.joblib
β”œβ”€β”€ models/priority_gbr.joblib
β”œβ”€β”€ models/feature_names.json
β”œβ”€β”€ models/feature_ranges.json
β”œβ”€β”€ models/dt_structure.json
β”œβ”€β”€ results/run_manifest.json
β”œβ”€β”€ results/pip_freeze.txt
β”œβ”€β”€ results/run_status.txt
β”œβ”€β”€ results/selector_metrics.json
β”œβ”€β”€ results/selector_metrics_table.csv
β”œβ”€β”€ results/priority_metrics.json
β”œβ”€β”€ results/benchmark_results.csv
β”œβ”€β”€ results/benchmark_summary.json
β”œβ”€β”€ results/statistical_tests.json
β”œβ”€β”€ results/switching_analysis.json
β”œβ”€β”€ results/paper_summary_table.csv
└── results/plots/*.png

2. Create the model repo (one-time)

This is where artifacts go and survive runtime termination.

  1. Go to https://huggingface.co/new β€” choose Model, not Space.
  2. Owner: your username. Name: DAHS-Models. Visibility: your choice.
  3. Click Create repository. Done β€” keep it empty; the run populates it.

Note the full id: your-username/DAHS-Models.


3. Create a fine-grained access token

  1. https://huggingface.co/settings/tokens β†’ Create new token β†’ Fine-grained.
  2. Repository permissions β†’ click Add repository β†’ select your-username/DAHS-Models β†’ check Write access to contents and discussions.
  3. (Optional) also grant Manage repo to the Space if you want auto-pause on completion.
  4. Copy the token starting with hf_… β€” you'll paste it in step 5.

4. Create the Space

  1. https://huggingface.co/new-space β†’ name DAHS-Training.
  2. SDK: Docker.
  3. Hardware: pick CPU upgrade (16 vCPU, 64 GB RAM).
  4. Visibility: your choice. Click Create Space.

5. Configure secrets (Space β†’ Settings β†’ Variables and secrets)

Name Type Value
HF_TOKEN Secret hf_… token from step 3
REPO_ID Variable your-username/DAHS-Models
SPACE_ID Variable your-username/DAHS-Training (auto-pause target)
DAHS_SCENARIOS Variable (optional) Override default 5000 scenarios
DAHS_EVAL_SEEDS Variable (optional) Override default 1000 eval seeds

SPACE_ID controls auto-pause after the run; without it you must pause manually to stop billing.


6. Push the code to the Space

From the project root, with your Hub credentials configured:

git lfs install                                    # only once per machine
git remote add space https://huggingface.co/spaces/your-username/DAHS-Training
git add Dockerfile requirements.txt src/ scripts/ tests/
git add README.md HF_UPLOAD_GUIDE.md server.py start.py
git commit -m "DAHS_2 Q1 pipeline"
git push space main

Alternatively, drag the files into the Space's web file browser. Either way, the Dockerfile at the repo root is what the Space builds, and its CMD ["python", "scripts/hf_runner.py"] is the entrypoint.


7. Watch the build and run

  1. Space opens β†’ Logs tab shows Docker build (3–5 min on first push).
  2. Once the container starts you should see:
    --- DAHS_2 HF RUNNER STARTING ---
    CPUs : 16, workers=15
    Repo : your-username/DAHS-Models
    [hub] periodic uploader started (every 300s)
    [ok] dummy health server on :7860
    --- PIPELINE: 5000 scenarios, 1000 eval seeds, 15 workers ---
    
  3. Within ~5 min the model repo should receive its first commit (results/run_manifest.json and results/pip_freeze.txt). Verify at https://huggingface.co/your-username/DAHS-Models/commits/main. If no commit appears in 10 minutes β€” the token or REPO_ID is wrong. Stop the Space immediately and re-check step 3 / 5.
  4. New commits land every 5 minutes. Per-step commits (selector_dataset, priority_dataset, selector_models, priority_model, evaluation) land as each pipeline phase finishes.

Total expected wall time on 16 vCPU: 2–4 hours.


8. After the run

  • results/run_status.txt will read SUCCESS or FAILED (exit N).
  • The Space auto-pauses if SPACE_ID was set. Verify the Status badge shows Paused so you stop being billed.
  • All artifacts are in your-username/DAHS-Models. Pull them locally with:
    python scripts/download_hf_artifacts.py
    
    or via:
    from huggingface_hub import snapshot_download
    snapshot_download(repo_id="your-username/DAHS-Models",
                      local_dir="./pulled_artifacts")
    

9. What survives if the runtime is killed mid-run?

Three independent persistence layers protect against the previous "models disappeared" failure:

Layer Trigger What it uploads
Per-step After each pipeline phase The folder produced by that phase
Periodic Every 5 min (background thread) All of data/, models/, results/, logs/
Terminal SIGTERM / SIGINT / atexit Final consolidated upload

Worst-case loss: ~5 min of work between periodic uploads, never the whole run. Each upload is retried with exponential backoff (4 attempts) so a flaky Hub call won't lose state.


10. Sanity-check checklist before clicking "Run"

Before you spend any credits, verify the local checks pass:

# from repo root
pip install -r requirements.txt
python -c "from src.hf_persistence import HubPersistor, from_env; print('OK')"
python -m pytest tests/ -q                 # unit tests
python scripts/run_pipeline.py --quick     # 50 scenarios, 20 eval seeds
                                           # finishes in ~2-3 minutes locally

If --quick produces models/*.joblib, results/selector_metrics.json, results/priority_metrics.json, results/benchmark_summary.json, and results/paper_summary_table.csv, the pipeline is verified end-to-end. You can then push to the Space with confidence.


11. Re-running

To re-run with different scenario/seed counts without rebuilding:

  1. Open the Space β†’ Settings β†’ Variables and secrets
  2. Edit DAHS_SCENARIOS / DAHS_EVAL_SEEDS
  3. Restart Space (not Factory rebuild β€” much faster)

Each re-run produces a new commit on the model repo, so you can compare runs side-by-side without overwriting prior artifacts.