hieu3636
/

cxr-vlm-code

Model card Files Files and versions

xet

Community

convitom commited on 19 days ago

Commit

9dadb47

1 Parent(s): c369576

f

Browse files

Files changed (2) hide show

scripts/cxrvlm_colab_train.ipynb +107 -365
training/train.py +214 -15

scripts/cxrvlm_colab_train.ipynb CHANGED Viewed

@@ -966,174 +966,91 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "id": "cell-resume-md"
    },
    "source": [
-    "## 5b. Resume a previous run (only if you were interrupted)\n",
     "\n",
-    "**Skip this section if you're starting fresh.** Set `RESUME_STAGE = None` in the cell below and run Stage 1 → Stage 2 normally.\n",
     "\n",
-    "### Khi nào cần resume\n",
     "\n",
-    "| Tình huống | Cần làm |\n",
-    "|---|---|\n",
-    "| Stage 1 đang train dở, cùng VM | `RESUME_STAGE=1`, `EXPLICIT_RUN_ID=None` |\n",
-    "| Stage 1 dở, **VM mới** (Colab disconnect) | `RESUME_STAGE=1`, `EXPLICIT_RUN_ID='IU-Xray_run_1'` |\n",
-    "| Stage 1 xong, chạy tiếp stage 2 | `RESUME_STAGE=2`, `EXPLICIT_RUN_ID=None` (cùng VM) hoặc set (VM mới) |\n",
-    "| Stage 2 đang dở, cùng VM | `RESUME_STAGE=2`, `EXPLICIT_RUN_ID=None` |\n",
-    "| Stage 2 dở, **VM mới** | `RESUME_STAGE=2`, `EXPLICIT_RUN_ID='IU-Xray_run_1'` |\n",
     "\n",
-    "### Nếu VM mới (Colab đã disconnect ít nhất 1 lần)\n",
-    "\n",
-    "Phải chạy lại **tất cả các cell từ đầu** trước khi đến đây:\n",
-    "1. `cell-select` → `cell-env` → `cell-paths` (pull code + data từ HF)\n",
-    "2. `cell-pip` → **Runtime → Restart session** → `cell-pip-verify`\n",
-    "3. `cell-hf-token` → `cell-cfg` → `cell-sanity`\n",
-    "4. **Skip `cell-stage1` / `cell-stage2` mặc định** — chạy `cell-resume` ngay bên dưới, rồi chạy cell train tương ứng.\n",
-    "\n",
-    "Cell resume bên dưới tự:\n",
-    "- Pull các folder `checkpoint-<step>` mới nhất từ HF Hub về (từ `stage1/` hoặc `stage2/` trên hub).\n",
-    "- Rename `stage1/` → `stage1_projection/`, `stage2/` → `stage2_instruct/` cho khớp layout local mà `train.py` kỳ vọng.\n",
-    "- Ghi lại `run_id.txt` trong `CKPT_ROOT`.\n",
-    "- Tìm checkpoint có step cao nhất và set `RESUME_FROM` cho cell train.\n",
-    "\n",
-    "### Train tiếp sẽ lưu ở đâu trên HF?\n",
-    "\n",
-    "**Vẫn cùng folder `<RUN_ID>/` trên HF.** `HFRunTracker` tái sử dụng `run_id` đã resolve nên:\n",
-    "- `hieu3636/cxr-vlm-runs/IU-Xray_run_1/stage2/checkpoint-XXXX/` (intermediate, callback mỗi 200 step)\n",
-    "- `hieu3636/cxr-vlm-runs/IU-Xray_run_1/stage2/stage2_final.pt` (final)\n",
-    "- `hieu3636/cxr-vlm-runs/IU-Xray_run_1/meta.json` (merge với meta cũ, tăng `resume_count`)\n",
-    "\n",
-    "Không tạo `run_2` mới.\n",
-    "\n",
-    "### Auto-detect run_id\n",
-    "\n",
-    "`train.py` resolve `run_id` theo thứ tự:\n",
-    "1. `--run_id` CLI (cell resume truyền cái này khi bạn set `EXPLICIT_RUN_ID`).\n",
-    "2. `run_id.txt` trong `CKPT_ROOT` (cell resume ghi lại).\n",
-    "3. Nếu cả 2 trống + `--resume_from` → tự tìm run mới nhất trên HF.\n"
    ],
    "id": "cell-resume-md"
   },
   {
    "cell_type": "code",
    "metadata": {
-    "id": "cell-resume",
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "outputId": "6f1b15fb-e751-4209-ea52-7a98a8c40212"
    },
    "execution_count": null,
-   "outputs": [
-    {
-     "output_type": "stream",
-     "name": "stdout",
-     "text": [
-      "RESUME_STAGE=None — fresh run. Skip this cell; go straight to cell-stage1.\n"
-     ]
-    }
-   ],
    "source": [
-    "# Resume controller — edit 2 variables, run once, then run cell-stage1 or cell-stage2.\n",
-    "RESUME_STAGE    = None       # None | 1 | 2    (None = fresh run, skip this cell)\n",
-    "EXPLICIT_RUN_ID = \"IU-Xray_run_1\"       # None | \"IU-Xray_run_1\"   (set this if VM is fresh)\n",
-    "\n",
-    "RESUME_FROM   = None\n",
-    "RESUME_RUN_ID = None\n",
-    "\n",
-    "if RESUME_STAGE is not None:\n",
-    "    assert RESUME_STAGE in (1, 2), \"RESUME_STAGE must be 1 or 2\"\n",
-    "\n",
-    "    # 1) Resolve run_id: explicit > local state_file\n",
-    "    if EXPLICIT_RUN_ID:\n",
-    "        RESUME_RUN_ID = EXPLICIT_RUN_ID\n",
-    "        CKPT_ROOT.mkdir(parents=True, exist_ok=True)\n",
-    "        (CKPT_ROOT / \"run_id.txt\").write_text(RESUME_RUN_ID)\n",
-    "        print(f\"Using EXPLICIT_RUN_ID = {RESUME_RUN_ID} (wrote run_id.txt)\")\n",
-    "    else:\n",
-    "        state_file = CKPT_ROOT / \"run_id.txt\"\n",
-    "        assert state_file.exists(), (\n",
-    "            \"No local run_id.txt — looks like a fresh VM. \"\n",
-    "            \"Set EXPLICIT_RUN_ID to the run folder on HF (e.g. \\\"IU-Xray_run_1\\\").\"\n",
-    "        )\n",
-    "        RESUME_RUN_ID = state_file.read_text().strip()\n",
-    "        print(f\"Using run_id from state file: {RESUME_RUN_ID}\")\n",
-    "\n",
-    "    # 2) Local subdir names (code expects long names; HF uses short \"stage1\"/\"stage2\")\n",
-    "    local_subdir  = \"stage1_projection\" if RESUME_STAGE == 1 else \"stage2_instruct\"\n",
-    "    remote_subdir = \"stage1\"             if RESUME_STAGE == 1 else \"stage2\"\n",
-    "    local_stage_dir = CKPT_ROOT / RESUME_RUN_ID / local_subdir\n",
-    "    local_stage_dir.mkdir(parents=True, exist_ok=True)\n",
-    "\n",
-    "    # 3) If no local checkpoints, pull from HF\n",
-    "    existing = sorted(local_stage_dir.glob(\"checkpoint-*\"),\n",
-    "                      key=lambda p: int(p.name.split(\"-\")[1]))\n",
-    "    if not existing:\n",
-    "        print(f\"No local checkpoints under {local_stage_dir} — pulling from HF Hub …\")\n",
-    "        from huggingface_hub import snapshot_download\n",
-    "        hub_prefix = f\"{RESUME_RUN_ID}/{remote_subdir}/\"\n",
-    "        pulled = snapshot_download(\n",
-    "            repo_id   = train_cfg.hf_hub.repo_id,\n",
-    "            repo_type = \"model\",\n",
-    "            token     = os.environ[\"HF_TOKEN\"],\n",
-    "            allow_patterns = [f\"{hub_prefix}checkpoint-*/**\"],\n",
-    "            local_dir = str(WORK / \"hf_pull\"),\n",
-    "        )\n",
-    "        hub_stage_dir = Path(pulled) / RESUME_RUN_ID / remote_subdir\n",
-    "        assert hub_stage_dir.exists() and any(hub_stage_dir.glob(\"checkpoint-*\")), (\n",
-    "            f\"No checkpoints found under {hub_prefix} on HF repo \"\n",
-    "            f\"{train_cfg.hf_hub.repo_id}. Did you set the right EXPLICIT_RUN_ID?\"\n",
-    "        )\n",
-    "        for ck in hub_stage_dir.glob(\"checkpoint-*\"):\n",
-    "            dst = local_stage_dir / ck.name\n",
-    "            if dst.exists():\n",
-    "                shutil.rmtree(dst)\n",
-    "            shutil.move(str(ck), str(dst))\n",
-    "        print(f\"Pulled {len(list(local_stage_dir.glob('checkpoint-*')))} checkpoint(s) → {local_stage_dir}\")\n",
-    "        existing = sorted(local_stage_dir.glob(\"checkpoint-*\"),\n",
-    "                          key=lambda p: int(p.name.split(\"-\")[1]))\n",
-    "\n",
-    "    assert existing, f\"Still no checkpoints under {local_stage_dir}\"\n",
-    "\n",
-    "    # 4) Latest checkpoint = highest global_step\n",
-    "    RESUME_FROM = existing[-1]\n",
-    "    print()\n",
-    "    print(f\"✓ Ready to resume STAGE {RESUME_STAGE} of run {RESUME_RUN_ID}\")\n",
-    "    print(f\"  checkpoints on disk : {[c.name for c in existing]}\")\n",
-    "    print(f\"  will resume from    : {RESUME_FROM}\")\n",
-    "    print()\n",
-    "    print(f\"→ Now run cell-stage{RESUME_STAGE} below.\")\n",
     "else:\n",
-    "    print(\"RESUME_STAGE=None — fresh run. Skip this cell; go straight to cell-stage1.\")\n"
    ],
    "id": "cell-resume"
   },
   {
    "cell_type": "markdown",
    "metadata": {
-    "id": "cell-stage1-md"
    },
    "source": [
-    "## 6. Stage 1 — projection layer only (~2 epochs)\n",
     "\n",
-    "First launch creates `{DATASET_NAME}_run_1/` on HF and on disk. Subsequent fresh launches auto-increment to `run_2`, `run_3`, … — tracked via `ckpt/run_id.txt`.\n",
     "\n",
-    "If you need to continue training from an existing checkpoint, pass `--resume_from <ckpt>` — that reuses the same `run_N` folder."
    ],
    "id": "cell-stage1-md"
   },
   {
    "cell_type": "code",
    "metadata": {
-    "id": "cell-stage1",
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "outputId": "c7d6c209-6790-473c-c1b7-a44441141785"
    },
    "source": [
-    "# Picks up RESUME_FROM / RESUME_RUN_ID from cell-resume (None if fresh run).\n",
     "import time as _time, json as _json\n",
     "from datetime import datetime as _dt, timezone as _tz\n",
     "from pathlib import Path as _Path\n",
@@ -1146,16 +1063,13 @@
     "    try:\n",
     "        from huggingface_hub import hf_hub_download\n",
     "        hf_hub_download(\n",
-    "            repo_id   = repo_id,\n",
-    "            repo_type = \"model\",\n",
-    "            filename  = f\"{run_id}/timing.json\",\n",
-    "            token     = token,\n",
-    "            local_dir = str(ckpt_root),\n",
     "        )\n",
     "        print(f\"[TIMING] pulled previous timing.json from HF → {local}\")\n",
-    "    except Exception as e:\n",
-    "        # First run for this run_id → no remote file yet. That's fine.\n",
-    "        pass\n",
     "\n",
     "def _push_timing_to_hf(run_id, ckpt_root, repo_id, token):\n",
     "    # Upload local timing.json to HF Hub under {run_id}/timing.json.\n",
@@ -1165,247 +1079,87 @@
     "    try:\n",
     "        from huggingface_hub import HfApi\n",
     "        HfApi(token=token).upload_file(\n",
-    "            path_or_fileobj = str(local),\n",
-    "            path_in_repo   = f\"{run_id}/timing.json\",\n",
-    "            repo_id        = repo_id,\n",
-    "            repo_type      = \"model\",\n",
-    "            commit_message = f\"timing.json @ {run_id}\",\n",
     "        )\n",
     "        print(f\"[TIMING] uploaded timing.json to HF → {repo_id}/{run_id}/timing.json\")\n",
     "    except Exception as e:\n",
     "        print(f\"[TIMING] upload failed (non-fatal): {e}\")\n",
     "\n",
     "\n",
-    "_resume_args = \"\"\n",
-    "_is_resume = False\n",
-    "if \"RESUME_FROM\" in dir() and RESUME_FROM and RESUME_STAGE == 1:\n",
-    "    _resume_args = f'--resume_from \"{RESUME_FROM}\" --run_id \"{RESUME_RUN_ID}\"'\n",
-    "    _is_resume = True\n",
-    "    print(\"▶ STAGE 1 resuming from\", RESUME_FROM)\n",
-    "else:\n",
-    "    print(\"▶ STAGE 1 fresh run\")\n",
-    "\n",
-    "# ─── Pre-pull timing.json from HF if resuming (best-effort) ────────────────\n",
     "_hf_repo  = getattr(train_cfg.hf_hub, \"repo_id\", None) if train_cfg.hf_hub.enabled else None\n",
     "_hf_token = os.environ.get(\"HF_TOKEN\")\n",
-    "if _is_resume and \"RESUME_RUN_ID\" in dir() and RESUME_RUN_ID:\n",
-    "    _pull_timing_from_hf(RESUME_RUN_ID, CKPT_ROOT, _hf_repo, _hf_token)\n",
     "\n",
-    "# ─── Start-of-stage timer ──────────────────────────────────────────────────\n",
-    "_t0_stage1 = _time.time()\n",
-    "_iso_start_stage1 = _dt.now(_tz.utc).isoformat(timespec=\"seconds\")\n",
     "\n",
-    "!HF_HUB_DISABLE_PROGRESS_BARS=1 TRANSFORMERS_VERBOSITY=warning TOKENIZERS_PARALLELISM=false BITSANDBYTES_NOWELCOME=1 PYTHONUNBUFFERED=1 \\\n",
-    "python -u -m training.train \\\n",
-    "    --model_config configs/model_config.yaml \\\n",
-    "    --train_config configs/train_config.yaml \\\n",
-    "    --stage 1 {_resume_args}\n",
-    "\n",
-    "# ─── End-of-stage timer + persist cumulative time ──────────────────────────\n",
-    "_elapsed_stage1 = _time.time() - _t0_stage1\n",
-    "_run_id_file = CKPT_ROOT / \"run_id.txt\"\n",
-    "if _run_id_file.exists():\n",
-    "    _run_id_now = _run_id_file.read_text().strip()\n",
-    "    _timing_path = CKPT_ROOT / _run_id_now / \"timing.json\"\n",
-    "    _timing_path.parent.mkdir(parents=True, exist_ok=True)\n",
-    "    _t = _json.loads(_timing_path.read_text()) if _timing_path.exists() else {\n",
-    "        \"stage1_elapsed_sec\":   0.0,\n",
-    "        \"stage2_elapsed_sec\":   0.0,\n",
-    "        \"resume_count_stage1\":  0,\n",
-    "        \"resume_count_stage2\":  0,\n",
-    "        \"first_started_at\":     None,\n",
-    "        \"last_finished_at\":     None,\n",
-    "        \"session_history\":      [],\n",
-    "    }\n",
-    "    if _t.get(\"first_started_at\") is None:\n",
-    "        _t[\"first_started_at\"] = _iso_start_stage1\n",
-    "    _t[\"stage1_elapsed_sec\"]  = float(_t.get(\"stage1_elapsed_sec\", 0.0)) + _elapsed_stage1\n",
-    "    _t[\"resume_count_stage1\"] = int(_t.get(\"resume_count_stage1\", 0)) + (1 if _is_resume else 0)\n",
-    "    _t[\"last_finished_at\"]    = _dt.now(_tz.utc).isoformat(timespec=\"seconds\")\n",
-    "    _t.setdefault(\"session_history\", []).append({\n",
-    "        \"stage\":      1,\n",
-    "        \"resumed\":    _is_resume,\n",
-    "        \"started\":    _iso_start_stage1,\n",
-    "        \"finished\":   _t[\"last_finished_at\"],\n",
-    "        \"elapsed_sec\": _elapsed_stage1,\n",
-    "    })\n",
-    "    _timing_path.write_text(_json.dumps(_t, indent=2))\n",
-    "\n",
-    "    # ─── Push to HF Hub so the timer survives a fresh VM ─────────────────\n",
-    "    _push_timing_to_hf(_run_id_now, CKPT_ROOT, _hf_repo, _hf_token)\n",
     "\n",
-    "    def _fmt(sec):\n",
-    "        h, r = divmod(int(sec), 3600); m, s = divmod(r, 60); return f\"{h:d}h {m:02d}m {s:02d}s\"\n",
-    "    print()\n",
-    "    print(f\"[TIMING] Stage 1 this session : {_fmt(_elapsed_stage1)}\")\n",
-    "    print(f\"[TIMING] Stage 1 cumulative   : {_fmt(_t['stage1_elapsed_sec'])}  (resumes so far: {_t['resume_count_stage1']})\")\n",
-    "    print(f\"[TIMING] persisted to         : {_timing_path}\")\n",
-    "else:\n",
-    "    print(\"[TIMING] run_id.txt missing — could not persist timing (training likely failed before resolve_run_id ran).\")\n"
-   ],
-   "execution_count": null,
-   "outputs": [],
-   "id": "cell-stage1"
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "cell-stage2-md"
-   },
-   "source": [
-    "## 7. Stage 2 — projection + LoRA instruction tuning\n",
-    "\n",
-    "Kaggle caps a GPU session at 9h. If Stage 2 doesn't finish, Persistence keeps the Trainer checkpoints in `/kaggle/working/ckpt/{RUN_ID}/stage2_instruct/checkpoint-XXXX/` — resume next session with:\n",
-    "```\n",
-    "!python -m training.train --stage 2 \\\n",
-    "    --model_config configs/model_config.yaml --train_config configs/train_config.yaml \\\n",
-    "    --resume_from /kaggle/working/ckpt/{RUN_ID}/stage2_instruct/checkpoint-XXXX\n",
-    "```\n",
-    "This **does not** create a new `run_N+1` on HF — it reuses the existing run."
-   ],
-   "id": "cell-stage2-md"
-  },
-  {
-   "cell_type": "code",
-   "metadata": {
-    "id": "cell-stage2",
-    "colab": {
-     "base_uri": "https://localhost:8080/"
-    },
-    "outputId": "d9cd4a96-ec88-4907-fc0e-38b6eb2f66a7"
-   },
-   "source": [
-    "# Picks up RESUME_FROM / RESUME_RUN_ID from cell-resume (None if fresh run).\n",
-    "import time as _time, json as _json\n",
-    "from datetime import datetime as _dt, timezone as _tz\n",
-    "from pathlib import Path as _Path\n",
-    "\n",
-    "def _pull_timing_from_hf(run_id, ckpt_root, repo_id, token):\n",
-    "    # Pull timing.json from HF Hub for this run if not present locally.\n",
-    "    local = ckpt_root / run_id / \"timing.json\"\n",
-    "    if local.exists() or not repo_id or not token:\n",
-    "        return\n",
-    "    try:\n",
-    "        from huggingface_hub import hf_hub_download\n",
-    "        hf_hub_download(\n",
-    "            repo_id   = repo_id,\n",
-    "            repo_type = \"model\",\n",
-    "            filename  = f\"{run_id}/timing.json\",\n",
-    "            token     = token,\n",
-    "            local_dir = str(ckpt_root),\n",
-    "        )\n",
-    "        print(f\"[TIMING] pulled previous timing.json from HF → {local}\")\n",
-    "    except Exception as e:\n",
-    "        # First run for this run_id → no remote file yet. That's fine.\n",
-    "        pass\n",
-    "\n",
-    "def _push_timing_to_hf(run_id, ckpt_root, repo_id, token):\n",
-    "    # Upload local timing.json to HF Hub under {run_id}/timing.json.\n",
-    "    local = ckpt_root / run_id / \"timing.json\"\n",
-    "    if not local.exists() or not repo_id or not token:\n",
-    "        return\n",
-    "    try:\n",
-    "        from huggingface_hub import HfApi\n",
-    "        HfApi(token=token).upload_file(\n",
-    "            path_or_fileobj = str(local),\n",
-    "            path_in_repo   = f\"{run_id}/timing.json\",\n",
-    "            repo_id        = repo_id,\n",
-    "            repo_type      = \"model\",\n",
-    "            commit_message = f\"timing.json @ {run_id}\",\n",
-    "        )\n",
-    "        print(f\"[TIMING] uploaded timing.json to HF → {repo_id}/{run_id}/timing.json\")\n",
-    "    except Exception as e:\n",
-    "        print(f\"[TIMING] upload failed (non-fatal): {e}\")\n",
-    "\n",
-    "\n",
-    "_resume_args = \"\"\n",
-    "_is_resume = False\n",
-    "if \"RESUME_FROM\" in dir() and RESUME_FROM and RESUME_STAGE == 2:\n",
-    "    _resume_args = f'--resume_from \"{RESUME_FROM}\" --run_id \"{RESUME_RUN_ID}\"'\n",
-    "    _is_resume = True\n",
-    "    print(\"▶ STAGE 2 resuming from\", RESUME_FROM)\n",
-    "elif \"RESUME_RUN_ID\" in dir() and RESUME_RUN_ID:\n",
-    "    _resume_args = f'--run_id \"{RESUME_RUN_ID}\"'\n",
-    "    print(\"▶ STAGE 2 fresh start, pinned to run_id\", RESUME_RUN_ID)\n",
-    "else:\n",
-    "    # ─── FIX: pin stage 2 to the run_id stage 1 just wrote ────────────────\n",
-    "    # Without this, train.py treats stage 2 as a brand-new launch and\n",
-    "    # allocates a NEW run_N folder, splitting stage1/stage2 across two runs.\n",
-    "    _state_file = CKPT_ROOT / \"run_id.txt\"\n",
-    "    if _state_file.exists():\n",
-    "        _pinned = _state_file.read_text().strip()\n",
-    "        _resume_args = f'--run_id \"{_pinned}\"'\n",
-    "        print(f\"▶ STAGE 2 fresh, auto-pinned to run_id from state file: {_pinned}\")\n",
-    "    else:\n",
-    "        print(\"▶ STAGE 2 fresh (no state file — train.py will allocate a new run_id)\")\n",
-    "\n",
-    "# ─── Pre-pull timing.json from HF (in case of fresh VM) ───────────────────\n",
-    "_hf_repo  = getattr(train_cfg.hf_hub, \"repo_id\", None) if train_cfg.hf_hub.enabled else None\n",
-    "_hf_token = os.environ.get(\"HF_TOKEN\")\n",
-    "# Best guess at run_id BEFORE training (may be missing if stage 1 wasn't run here)\n",
-    "_pre_state = CKPT_ROOT / \"run_id.txt\"\n",
-    "if _pre_state.exists():\n",
-    "    _pull_timing_from_hf(_pre_state.read_text().strip(), CKPT_ROOT, _hf_repo, _hf_token)\n",
     "\n",
-    "# ─── Start-of-stage timer ──────────────────────────────────────────────────\n",
-    "_t0_stage2 = _time.time()\n",
-    "_iso_start_stage2 = _dt.now(_tz.utc).isoformat(timespec=\"seconds\")\n",
     "\n",
     "!HF_HUB_DISABLE_PROGRESS_BARS=1 TRANSFORMERS_VERBOSITY=warning TOKENIZERS_PARALLELISM=false BITSANDBYTES_NOWELCOME=1 PYTHONUNBUFFERED=1 \\\n",
     "python -u -m training.train \\\n",
     "    --model_config configs/model_config.yaml \\\n",
     "    --train_config configs/train_config.yaml \\\n",
-    "    --stage 2 {_resume_args}\n",
     "\n",
-    "# ─── End-of-stage timer + persist cumulative time ──────────────────────────\n",
-    "_elapsed_stage2 = _time.time() - _t0_stage2\n",
     "_run_id_file = CKPT_ROOT / \"run_id.txt\"\n",
     "if _run_id_file.exists():\n",
     "    _run_id_now = _run_id_file.read_text().strip()\n",
     "    _timing_path = CKPT_ROOT / _run_id_now / \"timing.json\"\n",
     "    _timing_path.parent.mkdir(parents=True, exist_ok=True)\n",
     "    _t = _json.loads(_timing_path.read_text()) if _timing_path.exists() else {\n",
-    "        \"stage1_elapsed_sec\":   0.0,\n",
-    "        \"stage2_elapsed_sec\":   0.0,\n",
-    "        \"resume_count_stage1\":  0,\n",
-    "        \"resume_count_stage2\":  0,\n",
-    "        \"first_started_at\":     None,\n",
-    "        \"last_finished_at\":     None,\n",
-    "        \"session_history\":      [],\n",
     "    }\n",
     "    if _t.get(\"first_started_at\") is None:\n",
-    "        _t[\"first_started_at\"] = _iso_start_stage2\n",
-    "    _t[\"stage2_elapsed_sec\"]  = float(_t.get(\"stage2_elapsed_sec\", 0.0)) + _elapsed_stage2\n",
-    "    _t[\"resume_count_stage2\"] = int(_t.get(\"resume_count_stage2\", 0)) + (1 if _is_resume else 0)\n",
-    "    _t[\"last_finished_at\"]    = _dt.now(_tz.utc).isoformat(timespec=\"seconds\")\n",
     "    _t.setdefault(\"session_history\", []).append({\n",
-    "        \"stage\":      2,\n",
-    "        \"resumed\":    _is_resume,\n",
-    "        \"started\":    _iso_start_stage2,\n",
-    "        \"finished\":   _t[\"last_finished_at\"],\n",
-    "        \"elapsed_sec\": _elapsed_stage2,\n",
     "    })\n",
     "    _timing_path.write_text(_json.dumps(_t, indent=2))\n",
-    "\n",
-    "    # ─── Push to HF Hub ──────────────────────────────────────────────────\n",
     "    _push_timing_to_hf(_run_id_now, CKPT_ROOT, _hf_repo, _hf_token)\n",
     "\n",
     "    def _fmt(sec):\n",
     "        h, r = divmod(int(sec), 3600); m, s = divmod(r, 60); return f\"{h:d}h {m:02d}m {s:02d}s\"\n",
-    "    _total = _t[\"stage1_elapsed_sec\"] + _t[\"stage2_elapsed_sec\"]\n",
     "    print()\n",
-    "    print(f\"[TIMING] Stage 2 this session : {_fmt(_elapsed_stage2)}\")\n",
-    "    print(f\"[TIMING] Stage 2 cumulative   : {_fmt(_t['stage2_elapsed_sec'])}  (resumes so far: {_t['resume_count_stage2']})\")\n",
-    "    print(f\"[TIMING] Stage 1 + Stage 2    : {_fmt(_total)}\")\n",
-    "    print(f\"[TIMING] first started at     : {_t.get('first_started_at')}\")\n",
-    "    print(f\"[TIMING] last finished at     : {_t.get('last_finished_at')}\")\n",
-    "    print(f\"[TIMING] persisted to         : {_timing_path}\")\n",
     "else:\n",
     "    print(\"[TIMING] run_id.txt missing — could not persist timing.\")\n"
    ],
    "execution_count": null,
    "outputs": [],
-   "id": "cell-stage2"
   },
   {
    "cell_type": "markdown",
@@ -1557,16 +1311,13 @@
     "    try:\n",
     "        from huggingface_hub import hf_hub_download\n",
     "        hf_hub_download(\n",
-    "            repo_id   = repo_id,\n",
-    "            repo_type = \"model\",\n",
-    "            filename  = f\"{run_id}/timing.json\",\n",
-    "            token     = token,\n",
-    "            local_dir = str(ckpt_root),\n",
     "        )\n",
     "        print(f\"[TIMING] pulled previous timing.json from HF → {local}\")\n",
-    "    except Exception as e:\n",
-    "        # First run for this run_id → no remote file yet. That's fine.\n",
-    "        pass\n",
     "\n",
     "def _push_timing_to_hf(run_id, ckpt_root, repo_id, token):\n",
     "    # Upload local timing.json to HF Hub under {run_id}/timing.json.\n",
@@ -1576,11 +1327,10 @@
     "    try:\n",
     "        from huggingface_hub import HfApi\n",
     "        HfApi(token=token).upload_file(\n",
-    "            path_or_fileobj = str(local),\n",
-    "            path_in_repo   = f\"{run_id}/timing.json\",\n",
-    "            repo_id        = repo_id,\n",
-    "            repo_type      = \"model\",\n",
-    "            commit_message = f\"timing.json @ {run_id}\",\n",
     "        )\n",
     "        print(f\"[TIMING] uploaded timing.json to HF → {repo_id}/{run_id}/timing.json\")\n",
     "    except Exception as e:\n",
@@ -1588,10 +1338,10 @@
     "\n",
     "\n",
     "_run_id_file = CKPT_ROOT / \"run_id.txt\"\n",
-    "assert _run_id_file.exists(), \"No run_id.txt — train at least one stage first.\"\n",
     "_run_id = _run_id_file.read_text().strip()\n",
     "\n",
-    "# Pull the latest timing.json from HF in case we're on a fresh VM.\n",
     "_hf_repo  = getattr(train_cfg.hf_hub, \"repo_id\", None) if train_cfg.hf_hub.enabled else None\n",
     "_hf_token = os.environ.get(\"HF_TOKEN\")\n",
     "_pull_timing_from_hf(_run_id, CKPT_ROOT, _hf_repo, _hf_token)\n",
@@ -1599,32 +1349,24 @@
     "_timing_path = CKPT_ROOT / _run_id / \"timing.json\"\n",
     "assert _timing_path.exists(), (\n",
     "    f\"No timing.json under {_timing_path.parent} (also not on HF). \"\n",
-    "    f\"Was the stage cell run via the wrapped version?\"\n",
     ")\n",
     "\n",
     "_t = _json.loads(_timing_path.read_text())\n",
     "\n",
     "def _fmt(sec):\n",
-    "    h, r = divmod(int(sec or 0), 3600)\n",
-    "    m, s = divmod(r, 60)\n",
-    "    return f\"{h:d}h {m:02d}m {s:02d}s\"\n",
-    "\n",
-    "_s1 = float(_t.get(\"stage1_elapsed_sec\", 0.0))\n",
-    "_s2 = float(_t.get(\"stage2_elapsed_sec\", 0.0))\n",
-    "_total = _s1 + _s2\n",
     "\n",
     "print(f\"Run                : {_run_id}\")\n",
     "print(f\"First started at   : {_t.get('first_started_at')}\")\n",
     "print(f\"Last finished at   : {_t.get('last_finished_at')}\")\n",
-    "print()\n",
-    "print(f\"Stage 1 cumulative : {_fmt(_s1)}   (resumes: {_t.get('resume_count_stage1', 0)})\")\n",
-    "print(f\"Stage 2 cumulative : {_fmt(_s2)}   (resumes: {_t.get('resume_count_stage2', 0)})\")\n",
-    "print(f\"TOTAL              : {_fmt(_total)}\")\n",
     "print()\n",
     "print(\"Session history    :\")\n",
     "for _i, _s in enumerate(_t.get(\"session_history\", []), 1):\n",
-    "    _tag = \"(resume)\" if _s.get(\"resumed\") else \"(fresh) \"\n",
-    "    print(f\"  {_i:2d}. stage {_s['stage']} {_tag}  {_fmt(_s['elapsed_sec'])}   {_s['started']} → {_s['finished']}\")\n"
    ],
    "outputs": [],
    "execution_count": null

   {
    "cell_type": "markdown",
    "metadata": {
+    "id": "cell-mode-md"
    },
    "source": [
+    "## 5b. Resume controller\n",
     "\n",
+    "Single switch. No more \"which stage\" — `train.py` auto-detects which stage\n",
+    "to continue from by inspecting checkpoints on disk.\n",
     "\n",
+    "| MODE                | What happens |\n",
+    "|---------------------|--------------|\n",
+    "| `'fresh'`           | Allocate a brand-new `{DATASET}_run_N+1` folder. Train both stages from scratch. |\n",
+    "| `'resume'`          | Reuse latest matching `{DATASET}_run_N` (or `EXPLICIT_RUN_ID`). Auto-detect: stage 1 mid-checkpoint, stage 1 done → stage 2 fresh, stage 2 mid-checkpoint, or both done. |\n",
     "\n",
+    "`EXPLICIT_RUN_ID` is optional (set to `None` to auto-pick the latest run on\n",
+    "disk or HF Hub that matches the current dataset prefix).\n",
     "\n",
+    "When `MODE='resume'` on a fresh VM the train cell will pull the previous\n",
+    "run's checkpoints from HF before training. The `--mode resume` flag in\n",
+    "`train.py` does the auto-detect — no further action needed in the notebook."
    ],
    "id": "cell-resume-md"
   },
   {
    "cell_type": "code",
    "metadata": {
+    "id": "cell-mode",
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "outputId": "6f1b15fb-e751-4209-ea52-7a98a8c40212"
    },
    "execution_count": null,
+   "outputs": [],
    "source": [
+    "# Resume controller — set MODE once, run the train cell below.\n",
+    "MODE            = 'fresh'        # 'fresh' | 'resume'\n",
+    "EXPLICIT_RUN_ID = None           # None  | 'IU-Xray_run_5'   (only matters when MODE='resume')\n",
+    "\n",
+    "assert MODE in ('fresh', 'resume'), \"MODE must be 'fresh' or 'resume'\"\n",
+    "if MODE == 'resume' and EXPLICIT_RUN_ID:\n",
+    "    CKPT_ROOT.mkdir(parents=True, exist_ok=True)\n",
+    "    (CKPT_ROOT / 'run_id.txt').write_text(EXPLICIT_RUN_ID)\n",
+    "    print(f\"MODE=resume, pinning run_id = {EXPLICIT_RUN_ID}\")\n",
+    "elif MODE == 'resume':\n",
+    "    print(\"MODE=resume, run_id will be auto-resolved to the latest \"\n",
+    "          f\"'{DATASET_NAME}_run_*' on disk (or HF Hub).\")\n",
     "else:\n",
+    "    print(\"MODE=fresh — train.py will allocate a new run folder.\")\n"
    ],
    "id": "cell-resume"
   },
   {
    "cell_type": "markdown",
    "metadata": {
+    "id": "cell-train-md"
    },
    "source": [
+    "## 6. Train (both stages, one shot)\n",
     "\n",
+    "Single cell runs `train.py --mode {MODE}` which:\n",
     "\n",
+    "1. Resolves `run_id` (new vs. latest matching).\n",
+    "2. Prints the total step plan (stage 1 + stage 2 = global step count).\n",
+    "3. Auto-detects which stage to resume from by scanning the run folder.\n",
+    "4. Runs stage 1 (skip if already done), then stage 2.\n",
+    "5. Pushes intermediate `checkpoint-NNN/` + `best/` of each stage to HF Hub\n",
+    "   as before. `timing.json` is updated and uploaded after each stage.\n",
+    "\n",
+    "Kaggle / Colab caps a GPU session at ~9h. If the session dies mid-stage,\n",
+    "just re-run this cell with `MODE='resume'` — `train.py` picks up at the\n",
+    "last `checkpoint-NNN/` of whichever stage was in progress."
    ],
    "id": "cell-stage1-md"
   },
   {
    "cell_type": "code",
    "metadata": {
+    "id": "cell-train",
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "outputId": "c7d6c209-6790-473c-c1b7-a44441141785"
    },
    "source": [
+    "# Unified train cell — drives both stages in one shot, with auto-resume.\n",
     "import time as _time, json as _json\n",
     "from datetime import datetime as _dt, timezone as _tz\n",
     "from pathlib import Path as _Path\n",
     "    try:\n",
     "        from huggingface_hub import hf_hub_download\n",
     "        hf_hub_download(\n",
+    "            repo_id=repo_id, repo_type=\"model\",\n",
+    "            filename=f\"{run_id}/timing.json\",\n",
+    "            token=token, local_dir=str(ckpt_root),\n",
     "        )\n",
     "        print(f\"[TIMING] pulled previous timing.json from HF → {local}\")\n",
+    "    except Exception:\n",
+    "        pass  # first time for this run_id → no remote file yet, fine\n",
     "\n",
     "def _push_timing_to_hf(run_id, ckpt_root, repo_id, token):\n",
     "    # Upload local timing.json to HF Hub under {run_id}/timing.json.\n",
     "    try:\n",
     "        from huggingface_hub import HfApi\n",
     "        HfApi(token=token).upload_file(\n",
+    "            path_or_fileobj=str(local),\n",
+    "            path_in_repo=f\"{run_id}/timing.json\",\n",
+    "            repo_id=repo_id, repo_type=\"model\",\n",
+    "            commit_message=f\"timing.json @ {run_id}\",\n",
     "        )\n",
     "        print(f\"[TIMING] uploaded timing.json to HF → {repo_id}/{run_id}/timing.json\")\n",
     "    except Exception as e:\n",
     "        print(f\"[TIMING] upload failed (non-fatal): {e}\")\n",
     "\n",
     "\n",
+    "assert MODE in ('fresh', 'resume')\n",
     "_hf_repo  = getattr(train_cfg.hf_hub, \"repo_id\", None) if train_cfg.hf_hub.enabled else None\n",
     "_hf_token = os.environ.get(\"HF_TOKEN\")\n",
     "\n",
+    "# ─── Pre-pull timing.json if resuming ──────────────────────────────────────\n",
+    "if MODE == 'resume':\n",
+    "    _pre_state = CKPT_ROOT / \"run_id.txt\"\n",
+    "    if _pre_state.exists():\n",
+    "        _pull_timing_from_hf(_pre_state.read_text().strip(), CKPT_ROOT, _hf_repo, _hf_token)\n",
     "\n",
+    "_run_id_arg = \"\"\n",
+    "if MODE == 'resume' and EXPLICIT_RUN_ID:\n",
+    "    _run_id_arg = f'--run_id \"{EXPLICIT_RUN_ID}\"'\n",
     "\n",
+    "print(f\"▶ Launching train.py --mode {MODE} {_run_id_arg}\")\n",
+    "print(f\"  (Both stages dispatched in one call. Stage will be auto-detected for resume.)\")\n",
+    "print()\n",
     "\n",
+    "# ─── Wall-clock timing (entire training session) ──────────────────────────\n",
+    "_t0 = _time.time()\n",
+    "_iso_start = _dt.now(_tz.utc).isoformat(timespec=\"seconds\")\n",
     "\n",
     "!HF_HUB_DISABLE_PROGRESS_BARS=1 TRANSFORMERS_VERBOSITY=warning TOKENIZERS_PARALLELISM=false BITSANDBYTES_NOWELCOME=1 PYTHONUNBUFFERED=1 \\\n",
     "python -u -m training.train \\\n",
     "    --model_config configs/model_config.yaml \\\n",
     "    --train_config configs/train_config.yaml \\\n",
+    "    --mode {MODE} {_run_id_arg}\n",
+    "\n",
+    "_elapsed = _time.time() - _t0\n",
     "\n",
+    "# ─── Persist timing.json (cumulative across resumes) ──────────────────────\n",
     "_run_id_file = CKPT_ROOT / \"run_id.txt\"\n",
     "if _run_id_file.exists():\n",
     "    _run_id_now = _run_id_file.read_text().strip()\n",
     "    _timing_path = CKPT_ROOT / _run_id_now / \"timing.json\"\n",
     "    _timing_path.parent.mkdir(parents=True, exist_ok=True)\n",
     "    _t = _json.loads(_timing_path.read_text()) if _timing_path.exists() else {\n",
+    "        \"total_elapsed_sec\":   0.0,\n",
+    "        \"session_count\":       0,\n",
+    "        \"first_started_at\":    None,\n",
+    "        \"last_finished_at\":    None,\n",
+    "        \"session_history\":     [],\n",
     "    }\n",
     "    if _t.get(\"first_started_at\") is None:\n",
+    "        _t[\"first_started_at\"] = _iso_start\n",
+    "    _t[\"total_elapsed_sec\"] = float(_t.get(\"total_elapsed_sec\", 0.0)) + _elapsed\n",
+    "    _t[\"session_count\"]     = int(_t.get(\"session_count\", 0)) + 1\n",
+    "    _t[\"last_finished_at\"]  = _dt.now(_tz.utc).isoformat(timespec=\"seconds\")\n",
     "    _t.setdefault(\"session_history\", []).append({\n",
+    "        \"mode\":        MODE,\n",
+    "        \"started\":     _iso_start,\n",
+    "        \"finished\":    _t[\"last_finished_at\"],\n",
+    "        \"elapsed_sec\": _elapsed,\n",
     "    })\n",
     "    _timing_path.write_text(_json.dumps(_t, indent=2))\n",
     "    _push_timing_to_hf(_run_id_now, CKPT_ROOT, _hf_repo, _hf_token)\n",
     "\n",
     "    def _fmt(sec):\n",
     "        h, r = divmod(int(sec), 3600); m, s = divmod(r, 60); return f\"{h:d}h {m:02d}m {s:02d}s\"\n",
     "    print()\n",
+    "    print(f\"[TIMING] This session   : {_fmt(_elapsed)}\")\n",
+    "    print(f\"[TIMING] Cumulative     : {_fmt(_t['total_elapsed_sec'])}  ({_t['session_count']} session(s))\")\n",
+    "    print(f\"[TIMING] first started  : {_t.get('first_started_at')}\")\n",
+    "    print(f\"[TIMING] last finished  : {_t.get('last_finished_at')}\")\n",
+    "    print(f\"[TIMING] persisted to   : {_timing_path}\")\n",
     "else:\n",
     "    print(\"[TIMING] run_id.txt missing — could not persist timing.\")\n"
    ],
    "execution_count": null,
    "outputs": [],
+   "id": "cell-stage1"
   },
   {
    "cell_type": "markdown",
     "    try:\n",
     "        from huggingface_hub import hf_hub_download\n",
     "        hf_hub_download(\n",
+    "            repo_id=repo_id, repo_type=\"model\",\n",
+    "            filename=f\"{run_id}/timing.json\",\n",
+    "            token=token, local_dir=str(ckpt_root),\n",
     "        )\n",
     "        print(f\"[TIMING] pulled previous timing.json from HF → {local}\")\n",
+    "    except Exception:\n",
+    "        pass  # first time for this run_id → no remote file yet, fine\n",
     "\n",
     "def _push_timing_to_hf(run_id, ckpt_root, repo_id, token):\n",
     "    # Upload local timing.json to HF Hub under {run_id}/timing.json.\n",
     "    try:\n",
     "        from huggingface_hub import HfApi\n",
     "        HfApi(token=token).upload_file(\n",
+    "            path_or_fileobj=str(local),\n",
+    "            path_in_repo=f\"{run_id}/timing.json\",\n",
+    "            repo_id=repo_id, repo_type=\"model\",\n",
+    "            commit_message=f\"timing.json @ {run_id}\",\n",
     "        )\n",
     "        print(f\"[TIMING] uploaded timing.json to HF → {repo_id}/{run_id}/timing.json\")\n",
     "    except Exception as e:\n",
     "\n",
     "\n",
     "_run_id_file = CKPT_ROOT / \"run_id.txt\"\n",
+    "assert _run_id_file.exists(), \"No run_id.txt — run the train cell at least once.\"\n",
     "_run_id = _run_id_file.read_text().strip()\n",
     "\n",
+    "# Pull latest timing.json from HF in case this is a fresh VM\n",
     "_hf_repo  = getattr(train_cfg.hf_hub, \"repo_id\", None) if train_cfg.hf_hub.enabled else None\n",
     "_hf_token = os.environ.get(\"HF_TOKEN\")\n",
     "_pull_timing_from_hf(_run_id, CKPT_ROOT, _hf_repo, _hf_token)\n",
     "_timing_path = CKPT_ROOT / _run_id / \"timing.json\"\n",
     "assert _timing_path.exists(), (\n",
     "    f\"No timing.json under {_timing_path.parent} (also not on HF). \"\n",
+    "    f\"Did the train cell run?\"\n",
     ")\n",
     "\n",
     "_t = _json.loads(_timing_path.read_text())\n",
     "\n",
     "def _fmt(sec):\n",
+    "    h, r = divmod(int(sec or 0), 3600); m, s = divmod(r, 60); return f\"{h:d}h {m:02d}m {s:02d}s\"\n",
     "\n",
     "print(f\"Run                : {_run_id}\")\n",
     "print(f\"First started at   : {_t.get('first_started_at')}\")\n",
     "print(f\"Last finished at   : {_t.get('last_finished_at')}\")\n",
+    "print(f\"Session count      : {_t.get('session_count', 0)}\")\n",
+    "print(f\"TOTAL elapsed      : {_fmt(_t.get('total_elapsed_sec', 0.0))}\")\n",
     "print()\n",
     "print(\"Session history    :\")\n",
     "for _i, _s in enumerate(_t.get(\"session_history\", []), 1):\n",
+    "    print(f\"  {_i:2d}. mode={_s.get('mode','?'):6s}  {_fmt(_s['elapsed_sec'])}   \"\n",
+    "          f\"{_s['started']} → {_s['finished']}\")\n"
    ],
    "outputs": [],
    "execution_count": null

training/train.py CHANGED Viewed

@@ -69,11 +69,19 @@ def parse_args():
     )
     parser.add_argument(
         "--stage", type=int, default=None,
-        help="Run only stage 1 or stage 2 (default: run both)"
     )
     parser.add_argument(
         "--resume_from", type=str, default=None,
-        help="Path to checkpoint to resume from"
     )
     parser.add_argument(
         "--run_id", type=str, default=None,
@@ -87,6 +95,134 @@ def parse_args():
     return parser.parse_args()
 def get_trainer(
     model,
     train_dataset,
@@ -469,11 +605,21 @@ def main():
         train_cfg.hf_hub.token_env, os.environ.get("HF_TOKEN")
     ) if train_cfg.hf_hub.enabled else None
     hf_repo_id  = train_cfg.hf_hub.repo_id if train_cfg.hf_hub.enabled else None
     run_id = resolve_run_id(
         dataset_name = spec.dataset_name,
         output_root  = output_root,
         state_file   = state_file,
-        resuming     = bool(args.resume_from) or args.resume_from_hf,
         explicit     = args.run_id,
         hf_repo_id   = hf_repo_id,
         hf_token     = hf_token,
@@ -509,6 +655,35 @@ def main():
     stage2_out = stage_dir(output_root, run_id,
                            str(train_cfg.stage2.get("subdir", "stage2_instruct")))
     # ── Snapshot resolved config into the run dir ────────────────────
     # Every run gets its own self-describing folder so we never have to ask
     # "what config did IU-Xray_run_3 actually use?" — open run_meta.json.
@@ -560,19 +735,41 @@ def main():
         load_checkpoint(model, args.resume_from)
     # Run training stages
-    run_s1 = (args.stage is None or args.stage == 1) and train_cfg.stage1.enabled
-    run_s2 = (args.stage is None or args.stage == 2) and train_cfg.stage2.enabled
     if run_s1:
-        # Only pass resume_from to stage1 if stage1 is the explicit target
-        # (`--stage 1`). If user runs both stages with --resume_from, we assume
-        # the checkpoint is for stage2 and let stage1 run fresh (or finish fast
-        # if projection was already trained).
-        s1_resume = args.resume_from if args.stage == 1 else None
         model = run_stage1(
             model, train_cfg, model_cfg, spec, stage1_out, logger,
             tracker     = tracker,
-            resume_from = s1_resume,
         )
     if run_s2:
@@ -580,15 +777,17 @@ def main():
         # Priority:
         #   1. Just finished stage1 in this run → use stage1_out/stage1_final.pt
         #   2. Not running stage1 but stage1_final.pt exists on disk → load it
-        #   3. Nothing → warn loudly; stage2 starts with random projection.
         stage1_ckpt = Path(stage1_out) / "stage1_final.pt"
         if run_s1:
             load_checkpoint(model, str(stage1_ckpt))
             logger.info(f"Loaded stage1 weights from this run: {stage1_ckpt}")
-        elif stage1_ckpt.exists() and not args.resume_from:
             load_checkpoint(model, str(stage1_ckpt))
             logger.info(f"Auto-loaded existing stage1 weights: {stage1_ckpt}")
-        elif not args.resume_from:
             logger.warning(
                 "⚠ No stage1 weights found and not resuming. Projection layer "
                 "will start RANDOMLY for stage2. Expect degraded convergence. "
@@ -597,7 +796,7 @@ def main():
         model = run_stage2(
             model, train_cfg, model_cfg, spec, stage2_out, logger,
-            resume_from = args.resume_from if not run_s1 else None,
             tracker     = tracker,
         )

     )
     parser.add_argument(
         "--stage", type=int, default=None,
+        help="Run only stage 1 or stage 2 (default: run both). With --mode resume, "
+             "the stage is auto-detected and this flag should be left unset."
+    )
+    parser.add_argument(
+        "--mode", type=str, default=None, choices=["fresh", "resume"],
+        help="Unified resume controller. 'fresh' → new run_N folder. "
+             "'resume' → reuse latest matching run_id (or --run_id), auto-detect "
+             "which stage to continue from based on checkpoints on disk. "
+             "If unset, behaviour is inferred from --resume_from / --run_id (legacy)."
     )
     parser.add_argument(
         "--resume_from", type=str, default=None,
+        help="Path to checkpoint to resume from (legacy; prefer --mode resume)"
     )
     parser.add_argument(
         "--run_id", type=str, default=None,
     return parser.parse_args()
+# ─── Resume-point auto-detection ────────────────────────────────────────────
+def _list_checkpoints(stage_dir):
+    """Return [Path, …] of `checkpoint-NNN` folders sorted ascending by step."""
+    if not stage_dir.is_dir():
+        return []
+    out = []
+    for p in stage_dir.iterdir():
+        if not p.is_dir() or not p.name.startswith("checkpoint-"):
+            continue
+        suffix = p.name.split("-", 1)[1]
+        if suffix.isdigit():
+            out.append((int(suffix), p))
+    return [p for _, p in sorted(out)]
+def detect_resume_point(run_dir_path, stage1_subdir, stage2_subdir):
+    """
+    Inspect the run dir on disk and decide where to pick up training.
+    Returns a tuple `(target_stage, ckpt_path)` where:
+        target_stage  : "stage1" | "stage2" | "done"
+        ckpt_path     : Path to the checkpoint folder to pass to HF Trainer,
+                        or None if the stage should start from scratch.
+    Priority:
+      1. stage 2 final saved   → ("done", None)         everything finished
+      2. stage 2 has ckpts     → ("stage2", latest)     resume mid-stage2
+      3. stage 1 final saved   → ("stage2", None)       stage 1 done; start stage 2
+      4. stage 1 has ckpts     → ("stage1", latest)     resume mid-stage1
+      5. otherwise             → ("stage1", None)       brand-new run
+    """
+    from pathlib import Path as _P
+    run_dir_path = _P(run_dir_path)
+    s1d = run_dir_path / stage1_subdir
+    s2d = run_dir_path / stage2_subdir
+    if (s2d / "stage2_final_projection.pt").exists():
+        return ("done", None)
+    s2_ckpts = _list_checkpoints(s2d)
+    if s2_ckpts:
+        return ("stage2", s2_ckpts[-1])
+    if (s1d / "stage1_final_projection.pt").exists():
+        return ("stage2", None)
+    s1_ckpts = _list_checkpoints(s1d)
+    if s1_ckpts:
+        return ("stage1", s1_ckpts[-1])
+    return ("stage1", None)
+def compute_training_plan(train_cfg, instruct_json_path):
+    """
+    Compute a coarse plan of total optimizer steps across stage 1 + stage 2,
+    derived from the train_config + the train-split sample count in the
+    instruct JSON. Used to print a human-readable summary at startup.
+    Returns a dict (all ints) — gracefully handles missing fields.
+    """
+    import json as _json
+    tr = train_cfg.training
+    try:
+        with open(instruct_json_path, "r", encoding="utf-8") as f:
+            all_samples = _json.load(f)
+        train_count = sum(1 for s in all_samples if s.get("split") == "train")
+    except Exception:
+        train_count = 0
+    bs = int(getattr(tr, "per_device_train_batch_size", 1))
+    ga = int(getattr(tr, "gradient_accumulation_steps", 1))
+    eff = max(1, bs * ga)
+    steps_per_epoch = max(1, (train_count + eff - 1) // eff)
+    s1_enabled = bool(getattr(train_cfg.stage1, "enabled", True))
+    s2_enabled = bool(getattr(train_cfg.stage2, "enabled", True))
+    s1_epochs  = int(getattr(train_cfg.stage1, "num_epochs", 0)) if s1_enabled else 0
+    s2_epochs  = int(getattr(train_cfg.stage2, "num_epochs", 0)) if s2_enabled else 0
+    s1_steps = steps_per_epoch * s1_epochs
+    s2_steps = steps_per_epoch * s2_epochs
+    return {
+        "train_samples":   train_count,
+        "effective_batch": eff,
+        "steps_per_epoch": steps_per_epoch,
+        "stage1_steps":    s1_steps,
+        "stage2_steps":    s2_steps,
+        "total_steps":     s1_steps + s2_steps,
+        "stage1_epochs":   s1_epochs,
+        "stage2_epochs":   s2_epochs,
+    }
+def _fmt_plan_banner(plan, run_id, target_stage, resume_ckpt):
+    s1, s2, tot = plan["stage1_steps"], plan["stage2_steps"], plan["total_steps"]
+    head = f"TRAINING PLAN — {run_id}"
+    sep  = "=" * max(len(head) + 4, 60)
+    cur  = ""
+    if target_stage == "stage1":
+        offset = 0
+        if resume_ckpt and str(resume_ckpt).split("-")[-1].isdigit():
+            offset = int(str(resume_ckpt).split("-")[-1])
+        cur = f"Resuming at step {offset} / {tot}  (inside stage 1)"
+    elif target_stage == "stage2":
+        offset = s1
+        if resume_ckpt and str(resume_ckpt).split("-")[-1].isdigit():
+            offset = s1 + int(str(resume_ckpt).split("-")[-1])
+        cur = f"Resuming at step {offset} / {tot}  (inside stage 2)"
+    elif target_stage == "done":
+        cur = f"All {tot} steps already complete — nothing to do"
+    lines = [
+        sep, f"  {head}", sep,
+        f"  Train samples       : {plan['train_samples']:,}",
+        f"  Effective batch     : {plan['effective_batch']}",
+        f"  Steps / epoch       : {plan['steps_per_epoch']}",
+        f"  Stage 1             : {plan['stage1_epochs']} epochs → {s1} steps  (global steps 1–{s1})",
+        f"  Stage 2             : {plan['stage2_epochs']} epochs → {s2} steps  (global steps {s1+1}–{tot})",
+        f"  TOTAL               : {tot} optimizer steps",
+    ]
+    if cur:
+        lines += ["  " + "─" * (len(sep) - 4), f"  {cur}"]
+    lines.append(sep)
+    return "\n".join(lines)
 def get_trainer(
     model,
     train_dataset,
         train_cfg.hf_hub.token_env, os.environ.get("HF_TOKEN")
     ) if train_cfg.hf_hub.enabled else None
     hf_repo_id  = train_cfg.hf_hub.repo_id if train_cfg.hf_hub.enabled else None
+    # Unified --mode controller. Falls back to the legacy inference (any of
+    # --resume_from / --resume_from_hf set ⇒ resuming) when --mode is unset.
+    if args.mode == "resume":
+        resuming = True
+    elif args.mode == "fresh":
+        resuming = False
+    else:
+        resuming = bool(args.resume_from) or args.resume_from_hf
     run_id = resolve_run_id(
         dataset_name = spec.dataset_name,
         output_root  = output_root,
         state_file   = state_file,
+        resuming     = resuming,
         explicit     = args.run_id,
         hf_repo_id   = hf_repo_id,
         hf_token     = hf_token,
     stage2_out = stage_dir(output_root, run_id,
                            str(train_cfg.stage2.get("subdir", "stage2_instruct")))
+    # ── Auto-detect where to resume from (when --mode resume) ─────────
+    # Examines disk state inside {output_root}/{run_id}/ and chooses:
+    #   • stage1 from scratch / stage1 mid-checkpoint
+    #   • stage2 from scratch (stage1 done) / stage2 mid-checkpoint
+    #   • done (both stages finished — skip everything)
+    # If the user passed --stage explicitly, that wins over auto-detect.
+    auto_target_stage = None
+    auto_resume_ckpt  = None
+    if args.mode == "resume" and args.stage is None:
+        auto_target_stage, auto_resume_ckpt = detect_resume_point(
+            run_dir(output_root, run_id),
+            str(train_cfg.stage1.get("subdir", "stage1_projection")),
+            str(train_cfg.stage2.get("subdir", "stage2_instruct")),
+        )
+        logger.info(
+            f"[resume autodetect] target={auto_target_stage} "
+            f"ckpt={auto_resume_ckpt}"
+        )
+    # ── Pretty plan banner (total steps across both stages) ───────────
+    plan = compute_training_plan(train_cfg, spec.instruct_json)
+    logger.info("\n" + _fmt_plan_banner(plan, run_id,
+                                        auto_target_stage or "stage1",
+                                        auto_resume_ckpt))
+    if auto_target_stage == "done":
+        logger.info("Both stages already complete for this run. Exiting cleanly.")
+        return
     # ── Snapshot resolved config into the run dir ────────────────────
     # Every run gets its own self-describing folder so we never have to ask
     # "what config did IU-Xray_run_3 actually use?" — open run_meta.json.
         load_checkpoint(model, args.resume_from)
     # Run training stages
+    #
+    # Stage selection priority:
+    #   1. Explicit --stage from CLI wins.
+    #   2. --mode resume + auto-detect: skip stage1 when its final ckpt exists,
+    #      resume stage1/stage2 from `auto_resume_ckpt` as detected above.
+    #   3. Otherwise: enabled flags from train_cfg drive it (legacy: run both).
+    if args.stage is not None:
+        run_s1 = (args.stage == 1) and train_cfg.stage1.enabled
+        run_s2 = (args.stage == 2) and train_cfg.stage2.enabled
+    elif auto_target_stage == "stage2":
+        # Stage 1 finished previously — skip it entirely.
+        run_s1 = False
+        run_s2 = train_cfg.stage2.enabled
+    else:
+        run_s1 = train_cfg.stage1.enabled
+        run_s2 = train_cfg.stage2.enabled
+    # Decide the resume checkpoint each stage should use.
+    # Manual --resume_from still wins when --stage is given explicitly.
+    s1_resume_path = None
+    s2_resume_path = None
+    if args.stage == 1:
+        s1_resume_path = args.resume_from
+    elif args.stage == 2:
+        s2_resume_path = args.resume_from
+    elif auto_target_stage == "stage1":
+        s1_resume_path = str(auto_resume_ckpt) if auto_resume_ckpt else None
+    elif auto_target_stage == "stage2":
+        s2_resume_path = str(auto_resume_ckpt) if auto_resume_ckpt else None
     if run_s1:
         model = run_stage1(
             model, train_cfg, model_cfg, spec, stage1_out, logger,
             tracker     = tracker,
+            resume_from = s1_resume_path,
         )
     if run_s2:
         # Priority:
         #   1. Just finished stage1 in this run → use stage1_out/stage1_final.pt
         #   2. Not running stage1 but stage1_final.pt exists on disk → load it
+        #   3. s2_resume_path set (we're mid-stage2) → Trainer will reload from
+        #      the checkpoint itself; no need to seed stage1 weights here.
+        #   4. Nothing → warn loudly; stage2 starts with random projection.
         stage1_ckpt = Path(stage1_out) / "stage1_final.pt"
         if run_s1:
             load_checkpoint(model, str(stage1_ckpt))
             logger.info(f"Loaded stage1 weights from this run: {stage1_ckpt}")
+        elif stage1_ckpt.exists() and not s2_resume_path:
             load_checkpoint(model, str(stage1_ckpt))
             logger.info(f"Auto-loaded existing stage1 weights: {stage1_ckpt}")
+        elif not s2_resume_path:
             logger.warning(
                 "⚠ No stage1 weights found and not resuming. Projection layer "
                 "will start RANDOMLY for stage2. Expect degraded convergence. "
         model = run_stage2(
             model, train_cfg, model_cfg, spec, stage2_out, logger,
+            resume_from = s2_resume_path,
             tracker     = tracker,
         )