hieu3636
/

cxr-vlm-code

Model card Files Files and versions

xet

Community

convitom commited on 8 days ago

Commit

35d4872

1 Parent(s): 320063f

f

Browse files

Files changed (3) hide show

model/cxr_vlm.py +15 -0
opti.py +4 -0
scripts/resize_and_shard.ipynb +373 -20

model/cxr_vlm.py CHANGED Viewed

@@ -554,3 +554,18 @@ class CXRVisionLanguageModel(nn.Module):
         trainable = sum(p.numel() for p in self.parameters() if p.requires_grad)
         print(f"Trainable params: {trainable:,} / {total:,} "
               f"({100 * trainable / total:.2f}%)")

         trainable = sum(p.numel() for p in self.parameters() if p.requires_grad)
         print(f"Trainable params: {trainable:,} / {total:,} "
               f"({100 * trainable / total:.2f}%)")
+        # Tensor-count breakdown matching HF Trainer's optimizer param_groups
+        # (group 0 = weight-decay params, group 1 = biases + LayerNorm).
+        # Useful for diagnosing "optimizer state dict size mismatch" on resume.
+        try:
+            from transformers.trainer_pt_utils import get_parameter_names
+            from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
+            decay = set(get_parameter_names(self, ALL_LAYERNORM_LAYERS))
+            decay = {n for n in decay if "bias" not in n}
+            g0 = [n for n, p in self.named_parameters() if n in decay and p.requires_grad]
+            g1 = [n for n, p in self.named_parameters() if n not in decay and p.requires_grad]
+            print(f"  optimizer group 0 (decay):    {len(g0)} tensors")
+            print(f"  optimizer group 1 (no decay): {len(g1)} tensors")
+        except Exception as e:
+            print(f"  [param-group breakdown skipped: {e}]")

opti.py ADDED Viewed

	@@ -0,0 +1,4 @@

+import torch
+ckpt = torch.load(r"C:\Users\admin\Downloads\optimizer.pt", map_location="cpu", weights_only=False)
+for i, g in enumerate(ckpt["param_groups"]):
+    print(f"saved group {i}: {len(g['params'])} params")

scripts/resize_and_shard.ipynb CHANGED Viewed

@@ -4,97 +4,450 @@
    "cell_type": "markdown",
    "id": "c00",
    "metadata": {},
-   "source": "# CXR-VLM -- Resize + tar-shard dataset (one-time, offline)\n\nShrinks the original MIMIC-CXR tree (~2-3 MP/image) to RAD-DINO's working\nresolution and packs it into a few tar shards, so cloud training boxes\n(Vast.ai / Lightning.ai / Colab) pull a small, transfer-friendly dataset\ninstead of ~100 GB of huge JPGs read every epoch.\n\n**Source / destination (HF dataset repo `hieu3636/cxr-vlm-data`):**\n- read  : `MIMIC-CXR_processed/shards/*.tar`  (already tar-sharded source)\n- write : `MIMIC-CXR_resized/shards/cxr-NNNN.tar`  (+ `_manifest.json`, `SHARDS.txt`)\n\n**Flow: streaming per shard** -- for each source shard: download one tar ->\nextract -> resize/copy contents into the cumulative `resized/` tree ->\ndelete the tar + extract scratch. Peak disk usage ~10 GB instead of\n~200 GB if you downloaded + extracted everything first.\n\n**Why HF and not Google Drive:** notebook is meant to run on arbitrary\ncloud GPUs. Drive only mounts conveniently on Colab and rate-limits badly\non bulk many-file reads. HF works everywhere with just a token,\n`hf_hub_download` is parallel + resumable, and the data already lives\nthere. Both download and upload go through HF.\n\n**This step does NOT change what the model sees** -- RAD-DINO's processor\nresizes the shortest edge to 518 and center-crops 518x518 anyway; we just\ndo that downscale once, offline, instead of every epoch on full-res\nimages. Reports (`.txt`), CheXpert (`.csv`), and any other non-image\nfiles are copied verbatim so the resized release is a faithful mirror.\n\n**Prerequisite:** an `HF_TOKEN` with **write** access to the repo."
   },
   {
    "cell_type": "markdown",
    "id": "c01",
    "metadata": {},
-   "source": "## 0. Config -- edit here"
   },
   {
    "cell_type": "code",
-   "id": "c02",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "import os\nfrom pathlib import Path\n\nREPO_ID    = \"hieu3636/cxr-vlm-data\"\nREPO_TYPE  = \"dataset\"\nSRC_SUBDIR = \"MIMIC-CXR_processed\"   # source folder on HF (tar-sharded under shards/)\nDST_SUBDIR = \"MIMIC-CXR_resized\"     # output folder on HF (will hold shards/ too)\n\n# Big scratch disk on the VM (Vast/Lightning: /workspace, Colab: /content).\nWORK_DIR = Path(os.environ.get(\"WORK_DIR\", \"/workspace/cxr_resize\"))\n\n# --- resize params -------------------------------------------------------\nTARGET   = 518     # shortest-edge target. MUST be >= 518 (RAD-DINO crops 518).\nSQUARE   = False   # False: keep aspect (518xN), flexible, processor crops at\n                   #        train time. ~20% bigger.\n                   # True : also center-crop to 518x518 here -> file is exactly\n                   #        518x518 and the processor is a true no-op. Smaller,\n                   #        but BAKES the crop (changing backbone/img_size later\n                   #        needs a full rebuild). Recommended off for a thesis.\nQUALITY  = 90      # JPEG quality (q90 + 4:4:4 = near-lossless for CXR)\nSHARD_GB = 2.0     # approx GB per tar shard (output)\nWORKERS  = min(32, (os.cpu_count() or 8) * 4)  # I/O-bound; PIL frees the GIL\n\n# Derived local paths -- streaming flow keeps disk small:\n#   one tar at a time in DL_DIR, one tar extracted in SCRATCH, resized\n#   tree accumulates in DST_TREE.\nDL_DIR     = WORK_DIR / \"_dl\"        # per-shard download buffer (~2 GB at a time)\nSCRATCH    = WORK_DIR / \"_extract\"   # per-shard extraction scratch (~2 GB at a time)\nDST_TREE   = WORK_DIR / \"resized\"    # cumulative resized tree (~5-8 GB final)\nSHARDS_DIR = WORK_DIR / \"shards\"     # output tar shards (~5-8 GB final)\nfor p in (WORK_DIR, DL_DIR, SCRATCH, DST_TREE, SHARDS_DIR):\n    p.mkdir(parents=True, exist_ok=True)\n\nassert TARGET >= 518, \"TARGET must be >= 518 (RAD-DINO upscales shortest edge to 518)\"\nprint(\"WORK_DIR:\", WORK_DIR, \"| TARGET:\", TARGET, \"| SQUARE:\", SQUARE, \"| WORKERS:\", WORKERS)"
   },
   {
    "cell_type": "markdown",
    "id": "c03",
    "metadata": {},
-   "source": "## 1. Setup -- deps + HF token\n\nToken resolution: env `HF_TOKEN` -> Colab `userdata` -> Kaggle secret."
   },
   {
    "cell_type": "code",
-   "id": "c04",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "import os, sys, subprocess\nsubprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\",\n                \"huggingface_hub>=0.24,<0.27\", \"Pillow>=10\", \"tqdm\"], check=True)\n\n# Be tolerant of slow/flaky chunks on cloud networks (default is ~10s).\nos.environ.setdefault(\"HF_HUB_DOWNLOAD_TIMEOUT\", \"60\")\n# Optional: faster + more robust large-file downloads via the Rust backend.\n# Set to \"1\" and `pip install hf_transfer` if you keep hitting broken\n# connections; leave off if it causes trouble on your network.\n# os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\n\nif not os.environ.get(\"HF_TOKEN\"):\n    try:\n        from google.colab import userdata\n        os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n    except Exception:\n        try:\n            from kaggle_secrets import UserSecretsClient\n            os.environ[\"HF_TOKEN\"] = UserSecretsClient().get_secret(\"HF_TOKEN\")\n        except Exception:\n            pass\n\nHF_TOKEN = os.environ.get(\"HF_TOKEN\")\nassert HF_TOKEN, \"HF_TOKEN missing -- set it via env var or platform secrets (needs WRITE access).\"\nprint(\"HF_TOKEN loaded OK\")"
   },
   {
    "cell_type": "markdown",
    "id": "c05",
    "metadata": {},
-   "source": "## 2. Resize + pack logic (inlined, mirrors `scripts/build_resized_dataset.py`)\n\nUses a thread pool (not processes): PIL releases the GIL during\ndecode/resize/encode, so threads parallelise well and avoid notebook\nmultiprocessing pickling issues.\n\nSplit into two layers so the streaming orchestrator below can call the\nworker repeatedly (once per source shard) and accumulate counts before\nwriting a single final manifest:\n\n- `_walk_and_process(src, dst, ...)` -- walk one tree, resize images,\n  copy non-images, return `(counts, errors, n_img, n_other)`. No I/O\n  beyond reading src and writing dst; no manifest.\n- `resize_tree(...)` -- thin wrapper for standalone use (one src ->\n  one dst -> manifest). Used by the script CLI.\n- `_write_manifest(...)` -- shared manifest writer.\n- `pack_shards(...)` -- bundle the final tree into ~2 GB tar shards."
   },
   {
    "cell_type": "code",
-   "id": "c06",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "import os, json, shutil, tarfile, time\nfrom pathlib import Path\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom PIL import Image\nfrom tqdm.auto import tqdm\n\nImage.MAX_IMAGE_PIXELS = None          # don't abort on large medical images\nIMG_EXTS = (\".jpg\", \".jpeg\", \".png\")\n\n\ndef _resize_one(src_path, dst_path, target, quality, square):\n    \"\"\"Returns one of: resized | squared | copied | skipped | error:<msg>.\"\"\"\n    try:\n        dst_path = Path(dst_path)\n        if dst_path.exists() and dst_path.stat().st_size > 0:\n            return \"skipped\"                       # resumable\n        dst_path.parent.mkdir(parents=True, exist_ok=True)\n        with Image.open(src_path) as im:\n            w, h = im.size\n            shorter = min(w, h)\n            # Non-square: if shorter side already <= target, downscaling would\n            # push it below 518 -> copy verbatim (lossless, never worsens a\n            # low-res source). Square mode must always emit exactly target^2.\n            if not square and shorter <= target:\n                shutil.copy2(src_path, dst_path)\n                return \"copied\"\n            if im.mode not in (\"L\", \"RGB\"):\n                im = im.convert(\"RGB\")\n            # shorter axis EXACTLY = target; longer scales proportionally\n            if w <= h:\n                new_size = (target, round(h * target / w))\n            else:\n                new_size = (round(w * target / h), target)\n            # square mode reproduces the processor exactly -> bicubic\n            im = im.resize(new_size, Image.BICUBIC if square else Image.LANCZOS)\n            if square:\n                W, H = im.size\n                left, top = (W - target) // 2, (H - target) // 2\n                im = im.crop((left, top, left + target, top + target))\n            im.save(dst_path, \"JPEG\", quality=quality, optimize=True, subsampling=0)\n        return \"squared\" if square else \"resized\"\n    except Exception as e:\n        return f\"error:{type(e).__name__}: {e}\"\n\n\ndef _copy_one(src_path, dst_path):\n    \"\"\"Copy non-image files (reports .txt, chexpert .csv, metadata .json, ...)\n    verbatim so the shipped tree mirrors the source exactly.\"\"\"\n    try:\n        dst_path = Path(dst_path)\n        if dst_path.exists() and dst_path.stat().st_size > 0:\n            return \"skipped\"\n        dst_path.parent.mkdir(parents=True, exist_ok=True)\n        shutil.copy2(src_path, dst_path)\n        return \"copied_other\"\n    except Exception as e:\n        return f\"error:{type(e).__name__}: {e}\"\n\n\ndef _walk_and_process(src: Path, dst: Path, target, quality, workers, square):\n    \"\"\"Walk one src tree -> write resized/copied files into dst tree.\n    Returns (counts, errors, n_img, n_other). Does NOT write manifest.\"\"\"\n    img_jobs, other_jobs = [], []\n    for root, _, files in os.walk(src):\n        for fn in files:\n            sp = Path(root) / fn\n            rel = sp.relative_to(src)\n            dp = dst / rel\n            if fn.lower().endswith(IMG_EXTS):\n                img_jobs.append((str(sp), str(dp)))\n            else:\n                other_jobs.append((str(sp), str(dp)))\n    counts = {\"resized\": 0, \"squared\": 0, \"copied\": 0,\n              \"copied_other\": 0, \"skipped\": 0, \"error\": 0}\n    errors = []\n    with ThreadPoolExecutor(max_workers=workers) as ex:\n        futs = {}\n        for s, d in img_jobs:\n            futs[ex.submit(_resize_one, s, d, target, quality, square)] = d\n        for s, d in other_jobs:\n            futs[ex.submit(_copy_one, s, d)] = d\n        for f in as_completed(futs):\n            st = f.result()\n            if st.startswith(\"error:\"):\n                counts[\"error\"] += 1\n                errors.append(f\"{futs[f]}\\t{st}\")\n            else:\n                counts[st] += 1\n    return counts, errors, len(img_jobs), len(other_jobs)\n\n\ndef _write_manifest(dst: Path, *, src, target, quality, square,\n                    counts, errors, n_img, n_oth):\n    dst.mkdir(parents=True, exist_ok=True)\n    out_bytes = sum(p.stat().st_size for p in dst.rglob(\"*\") if p.is_file())\n    total = n_img + n_oth\n    (dst / \"_manifest.json\").write_text(json.dumps({\n        \"source\": src, \"target\": target,\n        \"mode\": \"square\" if square else \"shortest_edge\",\n        \"jpeg_quality\": quality, \"subsampling\": \"4:4:4\",\n        \"resampling\": \"BICUBIC\" if square else \"LANCZOS\",\n        \"counts\": counts, \"total\": total,\n        \"images\": n_img, \"non_image\": n_oth,\n        \"output_bytes\": out_bytes,\n        \"built_at\": time.strftime(\"%Y-%m-%dT%H:%M:%S\"),\n    }, indent=2), encoding=\"utf-8\")\n    if errors:\n        (dst / \"_errors.txt\").write_text(\"\\n\".join(errors), encoding=\"utf-8\")\n        print(f\"WARNING: {len(errors)} failures -> {dst/'_errors.txt'}\")\n    print(f\"done: {counts}\")\n    print(f\"output size: {out_bytes/1024**3:.2f} GB \"\n          f\"({out_bytes/max(1,n_img)/1024:.0f} KB/image avg)\")\n\n\ndef resize_tree(src: Path, dst: Path, target, quality, workers, square):\n    \"\"\"Standalone API: one src tree -> resized dst + manifest. (Not used by\n    the streaming flow below; kept for parity with the script CLI.)\"\"\"\n    print(f\"[resize] scanning {src} ...\")\n    counts, errors, n_img, n_oth = _walk_and_process(\n        src, dst, target, quality, workers, square)\n    mode = f\"square {target}x{target}\" if square else f\"shortest-edge {target}px\"\n    print(f\"[resize] {n_img:,} images + {n_oth:,} non-image -> {dst}  \"\n          f\"({mode}, q{quality}, {workers} threads)\")\n    _write_manifest(dst, src=str(src), target=target, quality=quality,\n                    square=square, counts=counts, errors=errors,\n                    n_img=n_img, n_oth=n_oth)\n\n\ndef pack_shards(dst: Path, shards_dir: Path, shard_gb, prefix=\"cxr\"):\n    shard_bytes = int(shard_gb * 1024**3)\n    shards_dir.mkdir(parents=True, exist_ok=True)\n    files = sorted(p for p in dst.rglob(\"*\")\n                   if p.is_file() and p.name not in (\"_manifest.json\", \"_errors.txt\"))\n    if not files:\n        raise SystemExit(f\"ERROR: nothing to pack under {dst}\")\n    print(f\"[pack] {len(files):,} files -> tar shards (~{shard_gb} GB each)\")\n    written, idx, cur = [], 0, 0\n\n    def _open(i):\n        path = shards_dir / f\"{prefix}-{i:04d}.tar\"\n        written.append(path)\n        return tarfile.open(path, \"w\")\n\n    tar = _open(0)\n    for fp in tqdm(files, unit=\"file\"):\n        if cur >= shard_bytes:\n            tar.close(); idx += 1; tar = _open(idx); cur = 0\n        tar.add(fp, arcname=str(fp.relative_to(dst)))   # rel path -> tree rebuilt on extract\n        cur += fp.stat().st_size\n    tar.close()\n    man = dst / \"_manifest.json\"\n    if man.exists():\n        shutil.copy2(man, shards_dir / \"_manifest.json\")\n    (shards_dir / \"SHARDS.txt\").write_text(\"\\n\".join(p.name for p in written), encoding=\"utf-8\")\n    print(f\"[pack] wrote {len(written)} shards -> {shards_dir}\")\n    return written\n\nprint(\"functions ready\")"
   },
   {
    "cell_type": "markdown",
    "id": "c07",
    "metadata": {},
-   "source": "## 3. Stream source shards: download -> extract -> resize -> cleanup (per shard)\n\nLoops over every `MIMIC-CXR_processed/shards/*.tar` on HF and processes\nthem one at a time. For each shard: pull it down, extract into a scratch\ndir, run `_walk_and_process` to resize images + copy reports into the\ncumulative `DST_TREE`, then delete the tar and the scratch. Peak disk\nstays around ~10 GB regardless of total source size, and the run is\nfully resumable -- already-resized files are skipped.\n\nTar arcname layout auto-detected on the first shard: handles both\n`files/p10/...` (rooted at content) and `MIMIC-CXR_processed/files/...`\n(rooted at parent).\""
   },
   {
    "cell_type": "code",
    "id": "c08",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "import os, time\nfrom huggingface_hub import HfApi, hf_hub_download\n\napi = HfApi(token=HF_TOKEN)\nall_files = api.list_repo_files(REPO_ID, repo_type=REPO_TYPE)\nsrc_shards = sorted(f for f in all_files\n                    if f.startswith(f\"{SRC_SUBDIR}/shards/\") and f.endswith(\".tar\"))\nif not src_shards:\n    raise SystemExit(\n        f\"ERROR: no .tar shards under {SRC_SUBDIR}/shards/ on {REPO_ID}. \"\n        f\"Check the path / your HF token has read access.\")\nprint(f\"found {len(src_shards)} source shards in {SRC_SUBDIR}/shards/\")\n\n# Per-shard 'done' markers, kept OUTSIDE DST_TREE so they're never packed.\n# A shard is marked done only after full success, so a resumed run skips\n# finished shards WITHOUT re-downloading them (~100 GB saved on a retry).\nDONE_DIR = WORK_DIR / \"_done_shards\"\nDONE_DIR.mkdir(parents=True, exist_ok=True)\n\n\ndef _download_retry(filename, retries=6, base_delay=5):\n    \"\"\"hf_hub_download resumes partial downloads, so retrying after a dropped\n    connection (ChunkedEncodingError / IncompleteRead) continues from where it\n    broke rather than restarting. Linear backoff.\"\"\"\n    for attempt in range(1, retries + 1):\n        try:\n            return hf_hub_download(\n                repo_id=REPO_ID, repo_type=REPO_TYPE, filename=filename,\n                local_dir=str(DL_DIR), token=HF_TOKEN)\n        except Exception as e:\n            if attempt == retries:\n                raise\n            wait = base_delay * attempt\n            print(f\"  [retry {attempt}/{retries}] {filename}: \"\n                  f\"{type(e).__name__}: {e} -> waiting {wait}s\")\n            time.sleep(wait)\n\n\ndef _detect_content_root(extracted):\n    \"\"\"Return the dir under `extracted` that holds `files/`. Handles arcnames\n    rooted at 'files/...' (=> extracted itself) or\n    '{SRC_SUBDIR}/files/...' (=> extracted/{SRC_SUBDIR}).\"\"\"\n    if (extracted / \"files\").is_dir():\n        return extracted\n    cand = extracted / SRC_SUBDIR\n    if (cand / \"files\").is_dir():\n        return cand\n    for p in extracted.rglob(\"files\"):\n        if p.is_dir():\n            return p.parent\n    return extracted   # last resort -- process whatever's there\n\n\ncum = {\"resized\": 0, \"squared\": 0, \"copied\": 0,\n       \"copied_other\": 0, \"skipped\": 0, \"error\": 0}\nall_errors = []\ncontent_offset = None\n\nfor shard in tqdm(src_shards, unit=\"shard\", desc=\"shards\"):\n    marker = DONE_DIR / (Path(shard).name + \".done\")\n    if marker.exists():\n        continue                                   # done in a previous run -> skip\n    # 1. Download this single shard (auto-retry + resume on broken connection)\n    local_tar = _download_retry(shard)\n    # 2. Fresh scratch + extract\n    if SCRATCH.exists():\n        shutil.rmtree(SCRATCH)\n    SCRATCH.mkdir(parents=True, exist_ok=True)\n    with tarfile.open(local_tar) as tf:\n        tf.extractall(SCRATCH)\n    # 3. Free the tar bytes\n    os.remove(local_tar)\n    # 4. Locate content root once (assume consistent across shards)\n    content_root = _detect_content_root(SCRATCH)\n    if content_offset is None:\n        content_offset = str(content_root.relative_to(SCRATCH)) or \"<top>\"\n        print(f\"[stream] tar content root: '{content_offset}/' \"\n              f\"(arcnames rooted at {content_offset})\")\n    # 5. Resize + copy this shard's tree into the cumulative DST_TREE\n    counts, errors, n_img, n_oth = _walk_and_process(\n        content_root, DST_TREE, TARGET, QUALITY, WORKERS, SQUARE)\n    for k, v in counts.items():\n        cum[k] += v\n    all_errors.extend(errors)\n    # 6. Free scratch + mark this shard done (only after full success)\n    shutil.rmtree(SCRATCH)\n    marker.write_text(\"ok\")\n\n# 7. Final manifest -- recount from the actual tree so totals are correct\n#    even across resumed runs (cum reflects only THIS run's work).\nfinal_img = sum(1 for p in DST_TREE.rglob(\"*\")\n                if p.is_file() and p.suffix.lower() in IMG_EXTS)\nfinal_oth = sum(1 for p in DST_TREE.rglob(\"*\")\n                if p.is_file() and p.suffix.lower() not in IMG_EXTS\n                and p.name not in (\"_manifest.json\", \"_errors.txt\"))\n_write_manifest(\n    DST_TREE,\n    src=f\"{REPO_ID}:{SRC_SUBDIR}/shards ({len(src_shards)} shards)\",\n    target=TARGET, quality=QUALITY, square=SQUARE,\n    counts=cum, errors=all_errors, n_img=final_img, n_oth=final_oth,\n)\nprint(f\"\\nresized tree -> {DST_TREE}  ({final_img:,} images, {final_oth:,} non-image)\")"
   },
   {
    "cell_type": "markdown",
    "id": "c13",
    "metadata": {},
-   "source": "## 4. Pack the resized tree into tar shards (output)"
   },
   {
    "cell_type": "code",
-   "id": "c14",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "shards = pack_shards(DST_TREE, SHARDS_DIR, SHARD_GB)\nprint(\"\\n\".join(p.name for p in shards))"
   },
   {
    "cell_type": "markdown",
    "id": "c15",
    "metadata": {},
-   "source": "## 5. Upload shards to HF (`MIMIC-CXR_resized/shards/`)\n\nMirrors the source layout: output sits at\n`hieu3636/cxr-vlm-data/MIMIC-CXR_resized/shards/cxr-NNNN.tar`.\""
   },
   {
    "cell_type": "code",
-   "id": "c16",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "from huggingface_hub import HfApi\n\nHfApi(token=HF_TOKEN).upload_folder(\n    folder_path  = str(SHARDS_DIR),\n    path_in_repo = f\"{DST_SUBDIR}/shards\",   # mirror source: <subdir>/shards/\n    repo_id      = REPO_ID,\n    repo_type    = REPO_TYPE,\n    token        = HF_TOKEN,\n    commit_message = f\"Add resized+sharded dataset ({DST_SUBDIR}, target={TARGET}, square={SQUARE})\",\n)\nprint(f\"OK: pushed -> https://huggingface.co/datasets/{REPO_ID}/tree/main/{DST_SUBDIR}/shards\")"
   },
   {
    "cell_type": "markdown",
    "id": "c17",
    "metadata": {},
-   "source": "## Done. On the training box, consume it like this\n\n```python\nfrom huggingface_hub import snapshot_download\nimport glob, tarfile, os\n\nDST = \"/workspace/MIMIC-CXR_resized\"\ndl = snapshot_download(\"hieu3636/cxr-vlm-data\", repo_type=\"dataset\",\n                       allow_patterns=\"MIMIC-CXR_resized/shards/*.tar\",\n                       local_dir=\"/workspace/dl\")\nos.makedirs(DST, exist_ok=True)\nfor t in sorted(glob.glob(\"/workspace/dl/MIMIC-CXR_resized/shards/*.tar\")):\n    with tarfile.open(t) as tf:\n        tf.extractall(DST)\n# -> DST now holds files/p10/... (same tree as the original, smaller JPGs)\n```\n\nThen point training at it -- edit `configs/train_config.yaml`:\n\n```yaml\nmimic_cxr_root: /workspace/MIMIC-CXR_resized\n```\n\nNo change to `dataset.py` / `cxr_vlm.py` -- the image tree is identical,\nonly the JPGs are smaller; reports / chexpert.csv come through too so\nauto-build of the instruct JSON works if needed. Extract once per VM\nsession, then train any number of epochs from the extracted tree.\n\n(Equivalent CLI using the repo script: `python scripts/build_resized_dataset.py\n--extract \"/workspace/dl/MIMIC-CXR_resized/shards/*.tar\" /workspace/MIMIC-CXR_resized`.)"
   }
  ],
  "metadata": {
@@ -109,4 +462,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}

    "cell_type": "markdown",
    "id": "c00",
    "metadata": {},
+   "source": [
+    "# CXR-VLM -- Resize + tar-shard dataset (one-time, offline)\n",
+    "\n",
+    "Shrinks the original MIMIC-CXR tree (~2-3 MP/image) to RAD-DINO's working\n",
+    "resolution and packs it into a few tar shards, so cloud training boxes\n",
+    "(Vast.ai / Lightning.ai / Colab) pull a small, transfer-friendly dataset\n",
+    "instead of ~100 GB of huge JPGs read every epoch.\n",
+    "\n",
+    "**Source / destination (HF dataset repo `hieu3636/cxr-vlm-data`):**\n",
+    "- read  : `MIMIC-CXR_processed/`  (tree `files/p{10-19}/.../*.jpg`)\n",
+    "- write : `MIMIC-CXR_resized/`    (tar shards `cxr-NNNN.tar` + manifest)\n",
+    "\n",
+    "**Why HF and not Google Drive for the transfer:** this notebook is meant to\n",
+    "run on arbitrary cloud GPUs. Drive only mounts conveniently on Colab and\n",
+    "rate-limits badly on bulk many-file reads. HF works everywhere with just a\n",
+    "token, `snapshot_download` is parallel + resumable, and the data already\n",
+    "lives there. So both download and upload go through HF.\n",
+    "\n",
+    "**This step does NOT change what the model sees** -- RAD-DINO's processor\n",
+    "resizes the shortest edge to 518 and center-crops 518x518 anyway; we just do\n",
+    "that downscale once, offline, instead of every epoch on full-res images.\n",
+    "See `scripts/build_resized_dataset.py` (this notebook mirrors its logic).\n",
+    "\n",
+    "**Prerequisite:** an `HF_TOKEN` with **write** access to the repo.\n",
+    "\n",
+    "**Disk:** needs room for source (~100 GB) + resized tree (~5-8 GB) +\n",
+    "shards (~5-8 GB). Set `DELETE_SOURCE_AFTER_RESIZE=True` to free the ~100 GB\n",
+    "before packing if the box is tight."
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "c01",
    "metadata": {},
+   "source": [
+    "## 0. Config -- edit here"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c02",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "import os\n",
+    "from pathlib import Path\n",
+    "\n",
+    "REPO_ID    = \"hieu3636/cxr-vlm-data\"\n",
+    "REPO_TYPE  = \"dataset\"\n",
+    "SRC_SUBDIR = \"MIMIC-CXR_processed\"   # folder in the repo holding files/p{10-19}/...\n",
+    "DST_SUBDIR = \"MIMIC-CXR_resized\"     # where the shards get uploaded back\n",
+    "\n",
+    "# Big scratch disk on the VM (Vast/Lightning: /workspace, Colab: /content).\n",
+    "WORK_DIR = Path(os.environ.get(\"WORK_DIR\", \"/content/cxr_resize\"))\n",
+    "\n",
+    "# --- resize params -------------------------------------------------------\n",
+    "TARGET   = 518     # shortest-edge target. MUST be >= 518 (RAD-DINO crops 518).\n",
+    "SQUARE   = True   # False: keep aspect (518xN), flexible, processor crops at\n",
+    "                   #        train time. ~20% bigger.\n",
+    "                   # True : also center-crop to 518x518 here -> file is exactly\n",
+    "                   #        518x518 and the processor is a true no-op. Smaller,\n",
+    "                   #        but BAKES the crop (changing backbone/img_size later\n",
+    "                   #        needs a full rebuild). Recommended off for a thesis.\n",
+    "QUALITY  = 90      # JPEG quality (q90 + 4:4:4 = near-lossless for CXR)\n",
+    "SHARD_GB = 2.0     # approx GB per tar shard\n",
+    "WORKERS  = min(32, (os.cpu_count() or 8) * 4)  # I/O-bound; PIL frees the GIL\n",
+    "\n",
+    "DELETE_SOURCE_AFTER_RESIZE = False  # True to free ~100 GB before packing\n",
+    "\n",
+    "# Derived local paths\n",
+    "DL_DIR     = WORK_DIR / \"download\"                 # snapshot_download target\n",
+    "SRC_TREE   = DL_DIR / SRC_SUBDIR                    # contains files/p10/...\n",
+    "DST_TREE   = WORK_DIR / \"resized\" / SRC_SUBDIR      # mirrors files/p10/...\n",
+    "SHARDS_DIR = WORK_DIR / \"shards\"\n",
+    "for p in (WORK_DIR, DL_DIR, SHARDS_DIR):\n",
+    "    p.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "assert TARGET >= 518, \"TARGET must be >= 518 (RAD-DINO upscales shortest edge to 518)\"\n",
+    "print(\"WORK_DIR:\", WORK_DIR, \"| TARGET:\", TARGET, \"| SQUARE:\", SQUARE, \"| WORKERS:\", WORKERS)"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "c03",
    "metadata": {},
+   "source": [
+    "## 1. Setup -- deps + HF token\n",
+    "\n",
+    "Token resolution: env `HF_TOKEN` -> Colab `userdata` -> Kaggle secret."
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c04",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\",\n",
+    "                \"huggingface_hub>=0.24,<0.27\", \"Pillow>=10\", \"tqdm\"], check=True)\n",
+    "\n",
+    "if not os.environ.get(\"HF_TOKEN\"):\n",
+    "    try:\n",
+    "        from google.colab import userdata\n",
+    "        os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n",
+    "    except Exception:\n",
+    "        try:\n",
+    "            from kaggle_secrets import UserSecretsClient\n",
+    "            os.environ[\"HF_TOKEN\"] = UserSecretsClient().get_secret(\"HF_TOKEN\")\n",
+    "        except Exception:\n",
+    "            pass\n",
+    "\n",
+    "HF_TOKEN = os.environ.get(\"HF_TOKEN\")\n",
+    "assert HF_TOKEN, \"HF_TOKEN missing -- set it via env var or platform secrets (needs WRITE access).\"\n",
+    "print(\"HF_TOKEN loaded OK\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "c05",
    "metadata": {},
+   "source": [
+    "## 2. Resize + pack logic (inlined, mirrors `scripts/build_resized_dataset.py`)\n",
+    "\n",
+    "Uses a thread pool (not processes): PIL releases the GIL during\n",
+    "decode/resize/encode, so threads parallelise well and avoid any\n",
+    "notebook multiprocessing pickling issues. Self-contained -- safe to\n",
+    "re-run this cell alone."
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c06",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "import os, json, shutil, tarfile, time\n",
+    "from pathlib import Path\n",
+    "from concurrent.futures import ThreadPoolExecutor, as_completed\n",
+    "from PIL import Image\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "Image.MAX_IMAGE_PIXELS = None          # don't abort on large medical images\n",
+    "IMG_EXTS = (\".jpg\", \".jpeg\", \".png\")\n",
+    "\n",
+    "\n",
+    "def _resize_one(src_path, dst_path, target, quality, square):\n",
+    "    \"\"\"Returns one of: resized | squared | copied | skipped | error:<msg>.\"\"\"\n",
+    "    try:\n",
+    "        dst_path = Path(dst_path)\n",
+    "        if dst_path.exists() and dst_path.stat().st_size > 0:\n",
+    "            return \"skipped\"                       # resumable\n",
+    "        dst_path.parent.mkdir(parents=True, exist_ok=True)\n",
+    "        with Image.open(src_path) as im:\n",
+    "            w, h = im.size\n",
+    "            shorter = min(w, h)\n",
+    "            # Non-square: if shorter side already <= target, downscaling would\n",
+    "            # push it below 518 -> copy verbatim (lossless, never worsens a\n",
+    "            # low-res source). Square mode must always emit exactly target^2.\n",
+    "            if not square and shorter <= target:\n",
+    "                shutil.copy2(src_path, dst_path)\n",
+    "                return \"copied\"\n",
+    "            if im.mode not in (\"L\", \"RGB\"):\n",
+    "                im = im.convert(\"RGB\")\n",
+    "            # shorter axis EXACTLY = target; longer scales proportionally\n",
+    "            if w <= h:\n",
+    "                new_size = (target, round(h * target / w))\n",
+    "            else:\n",
+    "                new_size = (round(w * target / h), target)\n",
+    "            # square mode reproduces the processor exactly -> bicubic\n",
+    "            im = im.resize(new_size, Image.BICUBIC if square else Image.LANCZOS)\n",
+    "            if square:\n",
+    "                W, H = im.size\n",
+    "                left, top = (W - target) // 2, (H - target) // 2\n",
+    "                im = im.crop((left, top, left + target, top + target))\n",
+    "            im.save(dst_path, \"JPEG\", quality=quality, optimize=True, subsampling=0)\n",
+    "        return \"squared\" if square else \"resized\"\n",
+    "    except Exception as e:\n",
+    "        return f\"error:{type(e).__name__}: {e}\"\n",
+    "\n",
+    "\n",
+    "def _copy_one(src_path, dst_path):\n",
+    "    \"\"\"Copy non-image files (reports .txt, chexpert .csv, metadata .json, ...)\n",
+    "    verbatim so the shipped tree mirrors MIMIC-CXR_processed exactly.\"\"\"\n",
+    "    try:\n",
+    "        dst_path = Path(dst_path)\n",
+    "        if dst_path.exists() and dst_path.stat().st_size > 0:\n",
+    "            return \"skipped\"\n",
+    "        dst_path.parent.mkdir(parents=True, exist_ok=True)\n",
+    "        shutil.copy2(src_path, dst_path)\n",
+    "        return \"copied_other\"\n",
+    "    except Exception as e:\n",
+    "        return f\"error:{type(e).__name__}: {e}\"\n",
+    "\n",
+    "\n",
+    "def resize_tree(src: Path, dst: Path, target, quality, workers, square):\n",
+    "    print(f\"[resize] scanning {src} ...\")\n",
+    "    img_jobs, other_jobs = [], []\n",
+    "    for root, _, files in os.walk(src):\n",
+    "        for fn in files:\n",
+    "            sp = Path(root) / fn\n",
+    "            rel = sp.relative_to(src)\n",
+    "            dp = dst / rel\n",
+    "            if fn.lower().endswith(IMG_EXTS):\n",
+    "                img_jobs.append((str(sp), str(dp)))\n",
+    "            else:\n",
+    "                other_jobs.append((str(sp), str(dp)))\n",
+    "    if not img_jobs and not other_jobs:\n",
+    "        raise SystemExit(f\"ERROR: nothing found under {src}\")\n",
+    "    mode = f\"square {target}x{target}\" if square else f\"shortest-edge {target}px\"\n",
+    "    print(f\"[resize] {len(img_jobs):,} images + {len(other_jobs):,} non-image \"\n",
+    "          f\"-> {dst}  ({mode}, q{quality}, {workers} threads)\")\n",
+    "\n",
+    "    counts = {\"resized\": 0, \"squared\": 0, \"copied\": 0,\n",
+    "              \"copied_other\": 0, \"skipped\": 0, \"error\": 0}\n",
+    "    errors = []\n",
+    "    with ThreadPoolExecutor(max_workers=workers) as ex:\n",
+    "        futs = {}\n",
+    "        for s, d in img_jobs:\n",
+    "            futs[ex.submit(_resize_one, s, d, target, quality, square)] = d\n",
+    "        for s, d in other_jobs:\n",
+    "            futs[ex.submit(_copy_one, s, d)] = d\n",
+    "        for f in tqdm(as_completed(futs), total=len(futs), unit=\"file\"):\n",
+    "            st = f.result()\n",
+    "            if st.startswith(\"error:\"):\n",
+    "                counts[\"error\"] += 1\n",
+    "                errors.append(f\"{futs[f]}\\t{st}\")\n",
+    "            else:\n",
+    "                counts[st] += 1\n",
+    "\n",
+    "    dst.mkdir(parents=True, exist_ok=True)\n",
+    "    total = len(img_jobs) + len(other_jobs)\n",
+    "    out_bytes = sum(p.stat().st_size for p in dst.rglob(\"*\") if p.is_file())\n",
+    "    (dst / \"_manifest.json\").write_text(json.dumps({\n",
+    "        \"source\": str(src), \"target\": target,\n",
+    "        \"mode\": \"square\" if square else \"shortest_edge\",\n",
+    "        \"jpeg_quality\": quality, \"subsampling\": \"4:4:4\",\n",
+    "        \"resampling\": \"BICUBIC\" if square else \"LANCZOS\",\n",
+    "        \"counts\": counts, \"total\": total,\n",
+    "        \"images\": len(img_jobs), \"non_image\": len(other_jobs),\n",
+    "        \"output_bytes\": out_bytes,\n",
+    "        \"built_at\": time.strftime(\"%Y-%m-%dT%H:%M:%S\"),\n",
+    "    }, indent=2), encoding=\"utf-8\")\n",
+    "    if errors:\n",
+    "        (dst / \"_errors.txt\").write_text(\"\\n\".join(errors), encoding=\"utf-8\")\n",
+    "        print(f\"[resize] WARNING: {len(errors)} failures -> {dst/'_errors.txt'}\")\n",
+    "    print(f\"[resize] done: {counts}\")\n",
+    "    print(f\"[resize] output size: {out_bytes/1024**3:.2f} GB \"\n",
+    "          f\"({out_bytes/max(1,len(img_jobs))/1024:.0f} KB/image avg)\")\n",
+    "\n",
+    "\n",
+    "def pack_shards(dst: Path, shards_dir: Path, shard_gb, prefix=\"cxr\"):\n",
+    "    shard_bytes = int(shard_gb * 1024**3)\n",
+    "    shards_dir.mkdir(parents=True, exist_ok=True)\n",
+    "    files = sorted(p for p in dst.rglob(\"*\")\n",
+    "                   if p.is_file() and p.name not in (\"_manifest.json\", \"_errors.txt\"))\n",
+    "    if not files:\n",
+    "        raise SystemExit(f\"ERROR: nothing to pack under {dst}\")\n",
+    "    print(f\"[pack] {len(files):,} files -> tar shards (~{shard_gb} GB each)\")\n",
+    "    written, idx, cur = [], 0, 0\n",
+    "\n",
+    "    def _open(i):\n",
+    "        path = shards_dir / f\"{prefix}-{i:04d}.tar\"\n",
+    "        written.append(path)\n",
+    "        return tarfile.open(path, \"w\")\n",
+    "\n",
+    "    tar = _open(0)\n",
+    "    for fp in tqdm(files, unit=\"file\"):\n",
+    "        if cur >= shard_bytes:\n",
+    "            tar.close(); idx += 1; tar = _open(idx); cur = 0\n",
+    "        tar.add(fp, arcname=str(fp.relative_to(dst)))   # rel path -> tree rebuilt on extract\n",
+    "        cur += fp.stat().st_size\n",
+    "    tar.close()\n",
+    "    man = dst / \"_manifest.json\"\n",
+    "    if man.exists():\n",
+    "        shutil.copy2(man, shards_dir / \"_manifest.json\")\n",
+    "    (shards_dir / \"SHARDS.txt\").write_text(\"\\n\".join(p.name for p in written), encoding=\"utf-8\")\n",
+    "    print(f\"[pack] wrote {len(written)} shards -> {shards_dir}\")\n",
+    "    return written\n",
+    "\n",
+    "print(\"functions ready\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "c07",
    "metadata": {},
+   "source": [
+    "## 3. Download source from HF (`MIMIC-CXR_processed/`)\n",
+    "\n",
+    "Parallel + resumable. Re-running skips already-downloaded files. This is\n",
+    "the slow step (~100 GB of full-res JPGs)."
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "c08",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import snapshot_download\n",
+    "\n",
+    "local = snapshot_download(\n",
+    "    repo_id   = REPO_ID,\n",
+    "    repo_type = REPO_TYPE,\n",
+    "    allow_patterns = f\"{SRC_SUBDIR}/**\",     # only the source folder\n",
+    "    local_dir = str(DL_DIR),\n",
+    "    token     = HF_TOKEN,\n",
+    "    max_workers = 16,\n",
+    ")\n",
+    "assert SRC_TREE.is_dir(), f\"expected {SRC_TREE} after download, not found\"\n",
+    "n = sum(1 for _ in SRC_TREE.rglob(\"*\") if _.suffix.lower() in IMG_EXTS)\n",
+    "print(f\"downloaded -> {SRC_TREE}  ({n:,} images)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c09",
+   "metadata": {},
+   "source": [
+    "## 4. Resize\n",
+    "\n",
+    "Reads the printed `output size` line + writes `_manifest.json` so you get\n",
+    "the real GB on your actual data. Resumable -- safe to re-run."
+   ]
+  },
+  {
+   "cell_type": "code",
    "execution_count": null,
+   "id": "c10",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "resize_tree(SRC_TREE, DST_TREE, TARGET, QUALITY, WORKERS, SQUARE)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c11",
+   "metadata": {},
+   "source": [
+    "## 5. (Optional) Free the ~100 GB source before packing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c12",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if DELETE_SOURCE_AFTER_RESIZE:\n",
+    "    shutil.rmtree(DL_DIR, ignore_errors=True)\n",
+    "    print(\"removed source download dir:\", DL_DIR)\n",
+    "else:\n",
+    "    print(\"keeping source (set DELETE_SOURCE_AFTER_RESIZE=True to free disk)\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "c13",
    "metadata": {},
+   "source": [
+    "## 6. Pack into tar shards"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c14",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "shards = pack_shards(DST_TREE, SHARDS_DIR, SHARD_GB)\n",
+    "print(\"\\n\".join(p.name for p in shards))"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "c15",
    "metadata": {},
+   "source": [
+    "## 7. Upload shards to HF (`MIMIC-CXR_resized/`)"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c16",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "from huggingface_hub import HfApi\n",
+    "\n",
+    "HfApi(token=HF_TOKEN).upload_folder(\n",
+    "    folder_path  = str(SHARDS_DIR),\n",
+    "    path_in_repo = DST_SUBDIR,\n",
+    "    repo_id      = REPO_ID,\n",
+    "    repo_type    = REPO_TYPE,\n",
+    "    token        = HF_TOKEN,\n",
+    "    commit_message = f\"Add resized+sharded dataset ({DST_SUBDIR}, target={TARGET}, square={SQUARE})\",\n",
+    ")\n",
+    "print(f\"OK: pushed -> https://huggingface.co/datasets/{REPO_ID}/tree/main/{DST_SUBDIR}\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "c17",
    "metadata": {},
+   "source": [
+    "## Done. On the training box, consume it like this\n",
+    "\n",
+    "```python\n",
+    "from huggingface_hub import snapshot_download\n",
+    "import glob, tarfile, os\n",
+    "\n",
+    "DST = \"/workspace/MIMIC-CXR_resized\"\n",
+    "dl = snapshot_download(\"hieu3636/cxr-vlm-data\", repo_type=\"dataset\",\n",
+    "                       allow_patterns=\"MIMIC-CXR_resized/**\",\n",
+    "                       local_dir=\"/workspace/dl\")\n",
+    "os.makedirs(DST, exist_ok=True)\n",
+    "for t in sorted(glob.glob(\"/workspace/dl/MIMIC-CXR_resized/*.tar\")):\n",
+    "    with tarfile.open(t) as tf:\n",
+    "        tf.extractall(DST)\n",
+    "# -> DST now holds files/p10/... (same tree as the original)\n",
+    "```\n",
+    "\n",
+    "Then point training at it -- edit `configs/train_config.yaml`:\n",
+    "\n",
+    "```yaml\n",
+    "mimic_cxr_root: /workspace/MIMIC-CXR_resized\n",
+    "```\n",
+    "\n",
+    "No change to `dataset.py` / `cxr_vlm.py` -- the image tree is identical,\n",
+    "only the JPGs are smaller. Extract once per VM session, then train any\n",
+    "number of epochs from the extracted tree.\n",
+    "\n",
+    "(Equivalent CLI using the repo script: `python scripts/build_resized_dataset.py\n",
+    "--extract \"/workspace/dl/MIMIC-CXR_resized/*.tar\" /workspace/MIMIC-CXR_resized`.)"
+   ]
   }
  ],
  "metadata": {
  },
  "nbformat": 4,
  "nbformat_minor": 5
+}