chore(data): WIP EDA notebooks + labeler comparison tooling

Pre-existing in-progress work (not part of the resize tooling): expanded
EDA notebooks, eval_labelers tweak, new img_stat.py and
labeler_comparison.csv. Committed to clean the working tree.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (8) hide show

data/build_subset_colab.ipynb +323 -33
data/build_subset_local.ipynb +0 -0
data/eda_full.ipynb +0 -0
data/eda_p18.ipynb +176 -35
data/eda_reports.ipynb +0 -0
data/img_stat.py +82 -0
dev/eval_labelers.py +5 -5
dev/labeler_comparison.csv +14 -0

data/build_subset_colab.ipynb CHANGED Viewed

@@ -3,7 +3,27 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "# Build Subset — PHASE 2 (Google Colab)\n\nChạy **sau** `build_subset_local.ipynb`. Input = `subset_bundle.zip` đã upload lên Drive.\n\n**Việc:**\n1. Giải nén bundle (manifest + reports + vqa)\n2. Tải ~50k ảnh JPG từ PhysioNet vào **đúng path gốc** `files/pXX/pSUBJ/sSTUDY/<dicom>.jpg` (resume — đứt thì chạy lại tiếp)\n3. Copy reports (giữ path gốc) + vqa.json vào package\n4. Push Hugging Face\n\n**Cấu trúc kết quả** (`MIMIC-CXR_processed/`):\n```\nfiles/pXX/pSUBJ/sSTUDY/<dicom>.jpg     ← ảnh (giữ tên gốc)\nfiles/pXX/pSUBJ/sSTUDY.txt             ← report\nmanifest_{train,val,test}.json/.csv    ← split + nhãn (đối chiếu khi tải đứt)\nvqa.json / vqa_val.json / vqa_test.json\n```\n\n> Không cần GPU. Ảnh tải thẳng vào Drive → resume an toàn khi Colab ngắt; đối chiếu manifest để biết còn thiếu study/ảnh nào."
   },
   {
    "cell_type": "markdown",
@@ -16,6 +36,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "source": [
     "import sys, os\n",
     "IN_COLAB = \"google.colab\" in sys.modules\n",
@@ -24,15 +45,63 @@
     "    drive.mount(\"/content/drive\")\n",
     "    !pip -q install huggingface_hub tqdm\n",
     "print(\"IN_COLAB =\", IN_COLAB)"
-   ],
-   "outputs": []
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "source": "from pathlib import Path\nimport os, getpass, zipfile, json, time, shutil\n\nDRIVE = Path(\"/content/drive/MyDrive\")\nBUNDLE_ZIP = DRIVE / \"subset_bundle.zip\"\nBUNDLE_DIR = Path(\"/content/subset_bundle\")\nOUT = DRIVE / \"MIMIC-CXR_processed\"\n\nHF_REPO_ID      = \"hieu3636/cxr-vlm-data\"\nHF_REPO_TYPE    = \"dataset\"\nHF_PATH_IN_REPO = \"MIMIC-CXR_processed\"\n\n# ── CREDENTIALS ──────────────────────────────────────────────────────────────\n# Cách 1 (KHUYẾN NGHỊ, an toàn + không hỏi lại): Colab Secrets.\n#   Bấm icon CHÌA KHOÁ 🔑 cột trái -> Add new secret, tạo 3 secret:\n#     PHYSIONET_USER , PHYSIONET_PASS , HF_TOKEN  (bật \"Notebook access\")\n#   -> set 1 lần, dùng mãi mọi session, KHÔNG nằm trong code.\n#\n# Cách 2 (bạn muốn): gõ thẳng vào đây. Nhanh nhưng LỘ nếu push/chia sẻ notebook.\n#   -> điền vào 3 dòng _HARDCODE_* bên dưới.\n#\n# Cách 3: để trống tất cả -> nó hỏi nhập tay khi chạy (như cũ).\n\n_HARDCODE_USER  = \"\"        # <- điền PhysioNet username (vd \"convitom\")\n_HARDCODE_PASS  = \"\"        # <- điền PhysioNet password\n_HARDCODE_HFTOK = \"\"        # <- điền HF write token\n\ndef _get(name, hard):\n    if hard:                                   # ưu tiên giá trị gõ thẳng\n        return hard\n    try:                                        # rồi tới Colab Secrets\n        from google.colab import userdata\n        v = userdata.get(name)\n        if v:\n            return v\n    except Exception:\n        pass\n    return os.environ.get(name)                 # rồi tới biến môi trường\n\nPHYSIONET_USER = _get(\"PHYSIONET_USER\", _HARDCODE_USER) or input(\"PhysioNet username: \")\nPHYSIONET_PASS = _get(\"PHYSIONET_PASS\", _HARDCODE_PASS) or getpass.getpass(\"PhysioNet password: \")\nHF_TOKEN       = _get(\"HF_TOKEN\",       _HARDCODE_HFTOK) or getpass.getpass(\"HF write token: \")\n\nVQA_OUT = {\"train\":\"vqa.json\",\"val\":\"vqa_val.json\",\"test\":\"vqa_test.json\"}\nOUT.mkdir(parents=True, exist_ok=True)\nprint(\"Credentials OK |\", \"bundle zip exists:\", BUNDLE_ZIP.exists())\nprint(\"⚠️ Nếu gõ thẳng pass/token vào code: ĐỪNG commit/push notebook này lên git/HF.\")",
-   "outputs": []
   },
   {
    "cell_type": "markdown",
@@ -45,6 +114,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "source": [
     "BUNDLE_DIR.mkdir(parents=True, exist_ok=True)\n",
     "with zipfile.ZipFile(BUNDLE_ZIP) as z:\n",
@@ -56,46 +126,196 @@
     "    print(f\"{sp}: {len(r):,} studies\")\n",
     "all_rows = manifests[\"train\"]+manifests[\"val\"]+manifests[\"test\"]\n",
     "print(\"TOTAL ảnh cần tải:\", len(all_rows))"
-   ],
-   "outputs": []
   },
   {
    "cell_type": "markdown",
    "id": "918e0272",
-   "source": "## 2. Tải ảnh JPG từ PhysioNet (qua `wget`) → thẳng vào package trên Drive\n\nPhysioNet từ chối `requests` basic-auth nhưng chấp nhận `wget` → dùng `wget` per-file, 12 luồng song song.\n\nĐứt giữa chừng → reconnect → chạy lại cell này, file đã tải (>10KB) được bỏ qua.",
    "metadata": {},
-   "execution_count": null,
-   "outputs": []
   },
   {
    "cell_type": "code",
-   "metadata": {},
-   "source": "import subprocess, threading, shutil\nfrom pathlib import Path as _P\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom collections import Counter\nfrom tqdm.auto import tqdm\n\n# Log per-file ghi vào /content (SSD, ~µs, không bóp tải).\n# Cứ mỗi CHECKPOINT_EVERY ảnh -> copy log sang Drive (1 thao tác, rẻ).\n# Session mới: đọc log Drive (tức thì) thay vì os.walk (chậm).\nDL_LOG_LOCAL = _P(\"/content/downloaded.txt\")\nDL_LOG_DRIVE = OUT / \"downloaded.txt\"\nCHECKPOINT_EVERY = 500\n\n_log_lk   = threading.Lock()\n_log_cnt  = 0\n\ndef mark_done(relpath):\n    global _log_cnt\n    with _log_lk:\n        with open(DL_LOG_LOCAL, \"a\") as f:\n            f.write(relpath + \"\\n\")\n        _log_cnt += 1\n        if _log_cnt % CHECKPOINT_EVERY == 0:\n            try:\n                shutil.copy(DL_LOG_LOCAL, DL_LOG_DRIVE)   # checkpoint -> Drive\n            except Exception as e:\n                print(\"  [warn] copy log -> Drive lỗi:\", e)\n\ndef flush_log_to_drive():\n    with _log_lk:\n        if DL_LOG_LOCAL.exists():\n            shutil.copy(DL_LOG_LOCAL, DL_LOG_DRIVE)\n\n# PhysioNet chặn requests-basic-auth nhưng OK với wget. Cell này CHỈ định nghĩa dl().\ndef dl(row):\n    rp  = row[\"image_relpath\"]\n    out = OUT / rp                                  # files/pXX/pSUBJ/sSTUDY/<dicom>.jpg\n    if out.exists() and out.stat().st_size > 10_000:\n        mark_done(rp)\n        return \"skip\"\n    out.parent.mkdir(parents=True, exist_ok=True)\n    tmp = out.with_suffix(\".part\")\n    cmd = [\"wget\", \"-q\", \"-T\", \"60\", \"-t\", \"3\", \"-O\", str(tmp),\n           \"--user\", PHYSIONET_USER, \"--password\", PHYSIONET_PASS, row[\"jpg_url\"]]\n    rc = subprocess.run(cmd).returncode\n    if rc == 0 and tmp.exists() and tmp.stat().st_size > 10_000:\n        tmp.replace(out)\n        mark_done(rp)\n        return \"ok\"\n    if tmp.exists():\n        tmp.unlink()\n    return f\"fail(rc={rc})\"\n\nprint(f\"dl() sẵn sàng. Log local={DL_LOG_LOCAL}, checkpoint -> {DL_LOG_DRIVE} mỗi {CHECKPOINT_EVERY} ảnh.\")",
    "execution_count": null,
-   "outputs": []
   },
   {
    "cell_type": "code",
    "id": "a78d23a9",
-   "source": "# ── TEST NHANH + ĐO TỐC ĐỘ trước khi tải 50k ─────────────────────────────────\n# Lần đầu: tải thật 10 ảnh để đo tốc độ + xác nhận auth.\n# Khi RESUME (Run all lại): ảnh đã có -> bỏ qua, không phí thời gian.\nimport time as _t\nsample = all_rows[:10]\nneed = [r for r in sample\n        if not ((OUT/r[\"image_relpath\"]).exists()\n                and (OUT/r[\"image_relpath\"]).stat().st_size > 10_000)]\n\nif not need:\n    print(f\"{len(sample)}/{len(sample)} ảnh test đã có sẵn (resume) — bỏ qua test.\")\n    print(\"✓ Chạy cell TẢI 50k bên dưới.\")\nelse:\n    print(f\"Test tải {len(need)} ảnh (đo tốc độ)...\")\n    t0 = _t.time(); tot = 0; ok = 0\n    for row in need:\n        st = dl(row)\n        p = OUT/row[\"image_relpath\"]\n        if p.exists() and p.stat().st_size > 10_000:\n            tot += p.stat().st_size; ok += 1\n        print(f\"  {row['study_name']:>10s} -> {st}\")\n    dt = _t.time() - t0\n    kbs = (tot/1024)/dt if dt > 0 else 0\n    avg = (tot/1e6)/ok if ok else 0\n    print(f\"\\n{ok}/{len(need)} OK | {tot/1e6:.1f} MB / {dt:.1f}s \"\n          f\"= {kbs:,.0f} KB/s ({kbs/1024:.2f} MB/s) [1 luồng]\")\n    if ok:\n        n = len(all_rows)\n        h = (n*avg)/(kbs/1024)/3600\n        print(f\"Ảnh ~{avg:.2f} MB | {n:,} ảnh ≈ {n*avg/1000:.0f} GB\")\n        print(f\"ETA: ~{h:.1f}h (1 luồng) → ~{h/12*1.6:.1f}-{h/12*3:.1f}h (12 luồng)\")\n    print(\"\\n\" + (\"✓ OK — chạy cell TẢI 50k bên dưới.\" if ok == len(need)\n                  else \"✗ Lỗi — kiểm tra user/pass, ĐỪNG chạy tải 50k.\"))",
    "metadata": {},
-   "execution_count": null,
-   "outputs": []
   },
   {
    "cell_type": "code",
    "id": "0f8b3dec",
-   "source": "# ── TẢI ẢNH (chỉ chạy khi cell TEST báo ✓ OK) ────────────────────────────────\n# Resume ưu tiên: log local -> log Drive (copy về local) -> os.walk (fallback).\nif DL_LOG_LOCAL.exists():\n    done_set = set(l.strip() for l in open(DL_LOG_LOCAL) if l.strip())\n    print(f\"[log local] {len(done_set):,} ảnh đã tải (tức thì).\")\nelif DL_LOG_DRIVE.exists():\n    shutil.copy(DL_LOG_DRIVE, DL_LOG_LOCAL)        # session mới: lấy checkpoint từ Drive\n    done_set = set(l.strip() for l in open(DL_LOG_LOCAL) if l.strip())\n    print(f\"[log Drive] session mới, đọc checkpoint: {len(done_set):,} ảnh (tức thì).\")\nelse:\n    print(\"Chưa có log -> quét Drive 1 lần để dựng log...\")\n    done_set = set()\n    froot = OUT / \"files\"\n    if froot.exists():\n        for dp, _, fns in os.walk(froot):\n            for fn in fns:\n                if fn.endswith(\".jpg\"):\n                    done_set.add(os.path.relpath(os.path.join(dp, fn), OUT).replace(\"\\\\\", \"/\"))\n    with open(DL_LOG_LOCAL, \"w\") as f:\n        f.write(\"\\n\".join(sorted(done_set)) + (\"\\n\" if done_set else \"\"))\n    if done_set:\n        shutil.copy(DL_LOG_LOCAL, DL_LOG_DRIVE)\n    print(f\"Dựng log xong: {len(done_set):,} ảnh.\")\n\ntodo = [r for r in all_rows if r[\"image_relpath\"] not in done_set]\nn_done = len(all_rows) - len(todo)\nprint(f\"Đã có : {n_done:,} / {len(all_rows):,} ({n_done/len(all_rows)*100:.1f}%)\")\nprint(f\"Cần tải: {len(todo):,}\")\n\nif not todo:\n    print(\"\\n✓ Đã tải đủ toàn bộ.\")\n    flush_log_to_drive()\nelse:\n    res = Counter({\"skip\": n_done})\n    try:\n        with ThreadPoolExecutor(max_workers=12) as ex:\n            futs = [ex.submit(dl, r) for r in todo]\n            for f in tqdm(as_completed(futs), total=len(todo), desc=\"downloading\"):\n                res[f.result().split(\"(\")[0]] += 1\n    finally:\n        flush_log_to_drive()                       # luôn lưu log Drive khi kết thúc/lỗi\n    print(dict(res))\n    if any(k.startswith(\"fail\") for k in res):\n        print(\"Còn fail -> chạy lại cell này (chỉ tải phần thiếu).\")",
    "metadata": {},
-   "execution_count": null,
-   "outputs": []
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "source": "# Kiểm tra còn thiếu ảnh nào không (đối chiếu manifest). Còn thì chạy lại cell tải.\nmiss = {sp: [] for sp in (\"train\",\"val\",\"test\")}\nfor row in all_rows:\n    p = OUT / row[\"image_relpath\"]\n    if not (p.exists() and p.stat().st_size > 0):\n        miss[row[\"split\"]].append(row[\"study_name\"])\nprint(\"ảnh còn thiếu:\", {k: len(v) for k, v in miss.items()},\n      \"| tổng:\", sum(len(v) for v in miss.values()))\n# In vài study còn thiếu để đối chiếu\nfor sp, names in miss.items():\n    if names:\n        print(f\"  [{sp}] thiếu {len(names)}, vd: {names[:5]}\")",
-   "outputs": []
   },
   {
    "cell_type": "markdown",
@@ -108,27 +328,76 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "source": "# Copy reports (giữ path gốc) + vqa + manifest vào package\n# reports: bundle/reports/files/pXX/.../sSTUDY.txt  ->  OUT/files/pXX/.../sSTUDY.txt\nshutil.copytree(BUNDLE_DIR/\"reports\", OUT, dirs_exist_ok=True)\n\nfor sp in (\"train\",\"val\",\"test\"):\n    shutil.copy(BUNDLE_DIR/\"vqa\"/VQA_OUT[sp], OUT/VQA_OUT[sp])\n    shutil.copy(BUNDLE_DIR/f\"manifest_{sp}.json\", OUT/f\"manifest_{sp}.json\")\n    src_csv = BUNDLE_DIR/f\"manifest_{sp}.csv\"\n    if src_csv.exists():\n        shutil.copy(src_csv, OUT/f\"manifest_{sp}.csv\")\n    nv = len(json.load(open(OUT/VQA_OUT[sp], encoding=\"utf-8\")))\n    print(f\"{sp}: vqa={nv:,}  manifest copied\")\n\nn_rep = sum(1 for _ in (OUT/'files').rglob('s*.txt'))\nprint(f\"reports trong package: {n_rep:,}\")",
-   "outputs": []
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "source": "# Sanity check cuối — đếm theo manifest (image + report tồn tại thực sự)\nfor sp in (\"train\",\"val\",\"test\"):\n    rows = manifests[sp]\n    ni = sum(1 for x in rows if (OUT/x[\"image_relpath\"]).exists())\n    nr = sum(1 for x in rows if (OUT/x[\"report_relpath\"]).exists())\n    print(f\"{sp:5s}  manifest={len(rows):,}  images_ok={ni:,}  reports_ok={nr:,}\")\nprint(\"\\nPackage:\", OUT)",
-   "outputs": []
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 4. Upload lên Hugging Face\n\nPush vào repo **có sẵn** `hieu3636/cxr-vlm-data`, nằm trong thư mục con `MIMIC-CXR_processed/` (dùng `path_in_repo`, không tạo repo mới)."
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "source": "RUN_HF = False   # ← bật khi sẵn sàng push\nif RUN_HF:\n    from huggingface_hub import HfApi\n    api = HfApi(token=HF_TOKEN)\n    # repo đã tồn tại sẵn -> exist_ok=True chỉ no-op, không tạo mới\n    api.create_repo(HF_REPO_ID, repo_type=HF_REPO_TYPE, exist_ok=True)\n    # upload_folder hỗ trợ path_in_repo -> đẩy vào thư mục con MIMIC-CXR_processed\n    # (chạy lại nếu đứt: file đã có trên repo được bỏ qua theo hash)\n    api.upload_folder(\n        repo_id      = HF_REPO_ID,\n        repo_type    = HF_REPO_TYPE,\n        folder_path  = str(OUT),\n        path_in_repo = HF_PATH_IN_REPO,\n        commit_message = \"Add MIMIC-CXR_processed subset\",\n    )\n    print(\"pushed →\",\n          f\"https://huggingface.co/{HF_REPO_TYPE}s/{HF_REPO_ID}/tree/main/{HF_PATH_IN_REPO}\")\nelse:\n    print(\"RUN_HF=False — bật True để push vào\",\n          f\"{HF_REPO_ID}/{HF_PATH_IN_REPO}\")",
-   "outputs": []
   },
   {
    "cell_type": "markdown",
@@ -143,15 +412,36 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "source": "RUN_ZIP = False\nif RUN_ZIP:\n    # zip cả package (giữ cấu trúc gốc) thành 1 file\n    shutil.make_archive(\"/content/MIMIC-CXR_processed\", \"zip\", OUT)\n    shutil.copy(\"/content/MIMIC-CXR_processed.zip\", DRIVE/\"MIMIC-CXR_processed.zip\")\n    print(\"zipped -> Drive/MIMIC-CXR_processed.zip\")\nelse:\n    print(\"RUN_ZIP=False\")",
-   "outputs": []
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "source": "print(\"=\"*54)\nprint(\"  PHASE 2 (COLAB) DONE\")\nprint(\"=\"*54)\nfor sp in (\"train\",\"val\",\"test\"):\n    rows=manifests[sp]\n    ni=sum(1 for x in rows if (OUT/x[\"image_relpath\"]).exists())\n    print(f\"  {sp:5s}  images={ni:,}/{len(rows):,}\")\nprint(f\"  package: {OUT}\")\nprint(\"  cấu trúc gốc: files/pXX/pSUBJ/sSTUDY/<dicom>.jpg  +  .txt\")\nprint(\"  Flags: RUN_HF / RUN_ZIP (mặc định False)\")\nprint(\"=\"*54)",
-   "outputs": []
   }
  ],
  "metadata": {
@@ -167,4 +457,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}

   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": [
+    "# Build Subset — PHASE 2 (Google Colab)\n",
+    "\n",
+    "Chạy **sau** `build_subset_local.ipynb`. Input = `subset_bundle.zip` đã upload lên Drive.\n",
+    "\n",
+    "**Việc:**\n",
+    "1. Giải nén bundle (manifest + reports + vqa)\n",
+    "2. Tải ~50k ảnh JPG từ PhysioNet vào **đúng path gốc** `files/pXX/pSUBJ/sSTUDY/<dicom>.jpg` (resume — đứt thì chạy lại tiếp)\n",
+    "3. Copy reports (giữ path gốc) + vqa.json vào package\n",
+    "4. Push Hugging Face\n",
+    "\n",
+    "**Cấu trúc kết quả** (`MIMIC-CXR_processed/`):\n",
+    "```\n",
+    "files/pXX/pSUBJ/sSTUDY/<dicom>.jpg     ← ảnh (giữ tên gốc)\n",
+    "files/pXX/pSUBJ/sSTUDY.txt             ← report\n",
+    "manifest_{train,val,test}.json/.csv    ← split + nhãn (đối chiếu khi tải đứt)\n",
+    "vqa.json / vqa_val.json / vqa_test.json\n",
+    "```\n",
+    "\n",
+    "> Không cần GPU. Ảnh tải thẳng vào Drive → resume an toàn khi Colab ngắt; đối chiếu manifest để biết còn thiếu study/ảnh nào."
+   ]
   },
   {
    "cell_type": "markdown",
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
     "import sys, os\n",
     "IN_COLAB = \"google.colab\" in sys.modules\n",
     "    drive.mount(\"/content/drive\")\n",
     "    !pip -q install huggingface_hub tqdm\n",
     "print(\"IN_COLAB =\", IN_COLAB)"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "08b13eef",
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import os, getpass, zipfile, json, time, shutil\n",
+    "\n",
+    "DRIVE = Path(\"/content/drive/MyDrive\")\n",
+    "BUNDLE_ZIP = DRIVE / \"subset_bundle.zip\"\n",
+    "BUNDLE_DIR = Path(\"/content/subset_bundle\")\n",
+    "OUT = DRIVE / \"MIMIC-CXR_processed\"\n",
+    "\n",
+    "HF_REPO_ID      = \"hieu3636/cxr-vlm-data\"\n",
+    "HF_REPO_TYPE    = \"dataset\"\n",
+    "HF_PATH_IN_REPO = \"MIMIC-CXR_processed\"\n",
+    "\n",
+    "# ── CREDENTIALS ──────────────────────────────────────────────────────────────\n",
+    "# Cách 1 (KHUYẾN NGHỊ, an toàn + không hỏi lại): Colab Secrets.\n",
+    "#   Bấm icon CHÌA KHOÁ 🔑 cột trái -> Add new secret, tạo 3 secret:\n",
+    "#     PHYSIONET_USER , PHYSIONET_PASS , HF_TOKEN  (bật \"Notebook access\")\n",
+    "#   -> set 1 lần, dùng mãi mọi session, KHÔNG nằm trong code.\n",
+    "#\n",
+    "# Cách 2 (bạn muốn): gõ thẳng vào đây. Nhanh nhưng LỘ nếu push/chia sẻ notebook.\n",
+    "#   -> điền vào 3 dòng _HARDCODE_* bên dưới.\n",
+    "#\n",
+    "# Cách 3: để trống tất cả -> nó hỏi nhập tay khi chạy (như cũ).\n",
+    "\n",
+    "_HARDCODE_USER  = \"\"        # <- điền PhysioNet username (vd \"convitom\")\n",
+    "_HARDCODE_PASS  = \"\"        # <- điền PhysioNet password\n",
+    "_HARDCODE_HFTOK = \"\"        # <- điền HF write token\n",
+    "\n",
+    "def _get(name, hard):\n",
+    "    if hard:                                   # ưu tiên giá trị gõ thẳng\n",
+    "        return hard\n",
+    "    try:                                        # rồi tới Colab Secrets\n",
+    "        from google.colab import userdata\n",
+    "        v = userdata.get(name)\n",
+    "        if v:\n",
+    "            return v\n",
+    "    except Exception:\n",
+    "        pass\n",
+    "    return os.environ.get(name)                 # rồi tới biến môi trường\n",
+    "\n",
+    "PHYSIONET_USER = _get(\"PHYSIONET_USER\", _HARDCODE_USER) or input(\"PhysioNet username: \")\n",
+    "PHYSIONET_PASS = _get(\"PHYSIONET_PASS\", _HARDCODE_PASS) or getpass.getpass(\"PhysioNet password: \")\n",
+    "HF_TOKEN       = _get(\"HF_TOKEN\",       _HARDCODE_HFTOK) or getpass.getpass(\"HF write token: \")\n",
+    "\n",
+    "VQA_OUT = {\"train\":\"vqa.json\",\"val\":\"vqa_val.json\",\"test\":\"vqa_test.json\"}\n",
+    "OUT.mkdir(parents=True, exist_ok=True)\n",
+    "print(\"Credentials OK |\", \"bundle zip exists:\", BUNDLE_ZIP.exists())\n",
+    "print(\"⚠️ Nếu gõ thẳng pass/token vào code: ĐỪNG commit/push notebook này lên git/HF.\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
     "BUNDLE_DIR.mkdir(parents=True, exist_ok=True)\n",
     "with zipfile.ZipFile(BUNDLE_ZIP) as z:\n",
     "    print(f\"{sp}: {len(r):,} studies\")\n",
     "all_rows = manifests[\"train\"]+manifests[\"val\"]+manifests[\"test\"]\n",
     "print(\"TOTAL ảnh cần tải:\", len(all_rows))"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "918e0272",
    "metadata": {},
+   "source": [
+    "## 2. Tải ảnh JPG từ PhysioNet (qua `wget`) → thẳng vào package trên Drive\n",
+    "\n",
+    "PhysioNet từ chối `requests` basic-auth nhưng chấp nhận `wget` → dùng `wget` per-file, 12 luồng song song.\n",
+    "\n",
+    "Đứt giữa chừng → reconnect → chạy lại cell này, file đã tải (>10KB) được bỏ qua."
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import subprocess, threading, shutil\n",
+    "from pathlib import Path as _P\n",
+    "from concurrent.futures import ThreadPoolExecutor, as_completed\n",
+    "from collections import Counter\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "# Log per-file ghi vào /content (SSD, ~µs, không bóp tải).\n",
+    "# Cứ mỗi CHECKPOINT_EVERY ảnh -> copy log sang Drive (1 thao tác, rẻ).\n",
+    "# Session mới: đọc log Drive (tức thì) thay vì os.walk (chậm).\n",
+    "DL_LOG_LOCAL = _P(\"/content/downloaded.txt\")\n",
+    "DL_LOG_DRIVE = OUT / \"downloaded.txt\"\n",
+    "CHECKPOINT_EVERY = 500\n",
+    "\n",
+    "_log_lk   = threading.Lock()\n",
+    "_log_cnt  = 0\n",
+    "\n",
+    "def mark_done(relpath):\n",
+    "    global _log_cnt\n",
+    "    with _log_lk:\n",
+    "        with open(DL_LOG_LOCAL, \"a\") as f:\n",
+    "            f.write(relpath + \"\\n\")\n",
+    "        _log_cnt += 1\n",
+    "        if _log_cnt % CHECKPOINT_EVERY == 0:\n",
+    "            try:\n",
+    "                shutil.copy(DL_LOG_LOCAL, DL_LOG_DRIVE)   # checkpoint -> Drive\n",
+    "            except Exception as e:\n",
+    "                print(\"  [warn] copy log -> Drive lỗi:\", e)\n",
+    "\n",
+    "def flush_log_to_drive():\n",
+    "    with _log_lk:\n",
+    "        if DL_LOG_LOCAL.exists():\n",
+    "            shutil.copy(DL_LOG_LOCAL, DL_LOG_DRIVE)\n",
+    "\n",
+    "# PhysioNet chặn requests-basic-auth nhưng OK với wget. Cell này CHỈ định nghĩa dl().\n",
+    "def dl(row):\n",
+    "    rp  = row[\"image_relpath\"]\n",
+    "    out = OUT / rp                                  # files/pXX/pSUBJ/sSTUDY/<dicom>.jpg\n",
+    "    if out.exists() and out.stat().st_size > 10_000:\n",
+    "        mark_done(rp)\n",
+    "        return \"skip\"\n",
+    "    out.parent.mkdir(parents=True, exist_ok=True)\n",
+    "    tmp = out.with_suffix(\".part\")\n",
+    "    cmd = [\"wget\", \"-q\", \"-T\", \"60\", \"-t\", \"3\", \"-O\", str(tmp),\n",
+    "           \"--user\", PHYSIONET_USER, \"--password\", PHYSIONET_PASS, row[\"jpg_url\"]]\n",
+    "    rc = subprocess.run(cmd).returncode\n",
+    "    if rc == 0 and tmp.exists() and tmp.stat().st_size > 10_000:\n",
+    "        tmp.replace(out)\n",
+    "        mark_done(rp)\n",
+    "        return \"ok\"\n",
+    "    if tmp.exists():\n",
+    "        tmp.unlink()\n",
+    "    return f\"fail(rc={rc})\"\n",
+    "\n",
+    "print(f\"dl() sẵn sàng. Log local={DL_LOG_LOCAL}, checkpoint -> {DL_LOG_DRIVE} mỗi {CHECKPOINT_EVERY} ảnh.\")"
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "a78d23a9",
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ── TEST NHANH + ĐO TỐC ĐỘ trước khi tải 50k ─────────────────────────────────\n",
+    "# Lần đầu: tải thật 10 ảnh để đo tốc độ + xác nhận auth.\n",
+    "# Khi RESUME (Run all lại): ảnh đã có -> bỏ qua, không phí thời gian.\n",
+    "import time as _t\n",
+    "sample = all_rows[:10]\n",
+    "need = [r for r in sample\n",
+    "        if not ((OUT/r[\"image_relpath\"]).exists()\n",
+    "                and (OUT/r[\"image_relpath\"]).stat().st_size > 10_000)]\n",
+    "\n",
+    "if not need:\n",
+    "    print(f\"{len(sample)}/{len(sample)} ảnh test đã có sẵn (resume) — bỏ qua test.\")\n",
+    "    print(\"✓ Chạy cell TẢI 50k bên dưới.\")\n",
+    "else:\n",
+    "    print(f\"Test tải {len(need)} ảnh (đo tốc độ)...\")\n",
+    "    t0 = _t.time(); tot = 0; ok = 0\n",
+    "    for row in need:\n",
+    "        st = dl(row)\n",
+    "        p = OUT/row[\"image_relpath\"]\n",
+    "        if p.exists() and p.stat().st_size > 10_000:\n",
+    "            tot += p.stat().st_size; ok += 1\n",
+    "        print(f\"  {row['study_name']:>10s} -> {st}\")\n",
+    "    dt = _t.time() - t0\n",
+    "    kbs = (tot/1024)/dt if dt > 0 else 0\n",
+    "    avg = (tot/1e6)/ok if ok else 0\n",
+    "    print(f\"\\n{ok}/{len(need)} OK | {tot/1e6:.1f} MB / {dt:.1f}s \"\n",
+    "          f\"= {kbs:,.0f} KB/s ({kbs/1024:.2f} MB/s) [1 luồng]\")\n",
+    "    if ok:\n",
+    "        n = len(all_rows)\n",
+    "        h = (n*avg)/(kbs/1024)/3600\n",
+    "        print(f\"Ảnh ~{avg:.2f} MB | {n:,} ảnh ≈ {n*avg/1000:.0f} GB\")\n",
+    "        print(f\"ETA: ~{h:.1f}h (1 luồng) → ~{h/12*1.6:.1f}-{h/12*3:.1f}h (12 luồng)\")\n",
+    "    print(\"\\n\" + (\"✓ OK — chạy cell TẢI 50k bên dưới.\" if ok == len(need)\n",
+    "                  else \"✗ Lỗi — kiểm tra user/pass, ĐỪNG chạy tải 50k.\"))"
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "0f8b3dec",
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ── TẢI ẢNH (chỉ chạy khi cell TEST báo ✓ OK) ────────────────────────────────\n",
+    "# Resume ưu tiên: log local -> log Drive (copy về local) -> os.walk (fallback).\n",
+    "if DL_LOG_LOCAL.exists():\n",
+    "    done_set = set(l.strip() for l in open(DL_LOG_LOCAL) if l.strip())\n",
+    "    print(f\"[log local] {len(done_set):,} ảnh đã tải (tức thì).\")\n",
+    "elif DL_LOG_DRIVE.exists():\n",
+    "    shutil.copy(DL_LOG_DRIVE, DL_LOG_LOCAL)        # session mới: lấy checkpoint từ Drive\n",
+    "    done_set = set(l.strip() for l in open(DL_LOG_LOCAL) if l.strip())\n",
+    "    print(f\"[log Drive] session mới, đọc checkpoint: {len(done_set):,} ảnh (tức thì).\")\n",
+    "else:\n",
+    "    print(\"Chưa có log -> quét Drive 1 lần để dựng log...\")\n",
+    "    done_set = set()\n",
+    "    froot = OUT / \"files\"\n",
+    "    if froot.exists():\n",
+    "        for dp, _, fns in os.walk(froot):\n",
+    "            for fn in fns:\n",
+    "                if fn.endswith(\".jpg\"):\n",
+    "                    done_set.add(os.path.relpath(os.path.join(dp, fn), OUT).replace(\"\\\\\", \"/\"))\n",
+    "    with open(DL_LOG_LOCAL, \"w\") as f:\n",
+    "        f.write(\"\\n\".join(sorted(done_set)) + (\"\\n\" if done_set else \"\"))\n",
+    "    if done_set:\n",
+    "        shutil.copy(DL_LOG_LOCAL, DL_LOG_DRIVE)\n",
+    "    print(f\"Dựng log xong: {len(done_set):,} ảnh.\")\n",
+    "\n",
+    "todo = [r for r in all_rows if r[\"image_relpath\"] not in done_set]\n",
+    "n_done = len(all_rows) - len(todo)\n",
+    "print(f\"Đã có : {n_done:,} / {len(all_rows):,} ({n_done/len(all_rows)*100:.1f}%)\")\n",
+    "print(f\"Cần tải: {len(todo):,}\")\n",
+    "\n",
+    "if not todo:\n",
+    "    print(\"\\n✓ Đã tải đủ toàn bộ.\")\n",
+    "    flush_log_to_drive()\n",
+    "else:\n",
+    "    res = Counter({\"skip\": n_done})\n",
+    "    try:\n",
+    "        with ThreadPoolExecutor(max_workers=12) as ex:\n",
+    "            futs = [ex.submit(dl, r) for r in todo]\n",
+    "            for f in tqdm(as_completed(futs), total=len(todo), desc=\"downloading\"):\n",
+    "                res[f.result().split(\"(\")[0]] += 1\n",
+    "    finally:\n",
+    "        flush_log_to_drive()                       # luôn lưu log Drive khi kết thúc/lỗi\n",
+    "    print(dict(res))\n",
+    "    if any(k.startswith(\"fail\") for k in res):\n",
+    "        print(\"Còn fail -> chạy lại cell này (chỉ tải phần thiếu).\")"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Kiểm tra còn thiếu ảnh nào không (đối chiếu manifest). Còn thì chạy lại cell tải.\n",
+    "miss = {sp: [] for sp in (\"train\",\"val\",\"test\")}\n",
+    "for row in all_rows:\n",
+    "    p = OUT / row[\"image_relpath\"]\n",
+    "    if not (p.exists() and p.stat().st_size > 0):\n",
+    "        miss[row[\"split\"]].append(row[\"study_name\"])\n",
+    "print(\"ảnh còn thiếu:\", {k: len(v) for k, v in miss.items()},\n",
+    "      \"| tổng:\", sum(len(v) for v in miss.values()))\n",
+    "# In vài study còn thiếu để đối chiếu\n",
+    "for sp, names in miss.items():\n",
+    "    if names:\n",
+    "        print(f\"  [{sp}] thiếu {len(names)}, vd: {names[:5]}\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copy reports (giữ path gốc) + vqa + manifest vào package\n",
+    "# reports: bundle/reports/files/pXX/.../sSTUDY.txt  ->  OUT/files/pXX/.../sSTUDY.txt\n",
+    "shutil.copytree(BUNDLE_DIR/\"reports\", OUT, dirs_exist_ok=True)\n",
+    "\n",
+    "for sp in (\"train\",\"val\",\"test\"):\n",
+    "    shutil.copy(BUNDLE_DIR/\"vqa\"/VQA_OUT[sp], OUT/VQA_OUT[sp])\n",
+    "    shutil.copy(BUNDLE_DIR/f\"manifest_{sp}.json\", OUT/f\"manifest_{sp}.json\")\n",
+    "    src_csv = BUNDLE_DIR/f\"manifest_{sp}.csv\"\n",
+    "    if src_csv.exists():\n",
+    "        shutil.copy(src_csv, OUT/f\"manifest_{sp}.csv\")\n",
+    "    nv = len(json.load(open(OUT/VQA_OUT[sp], encoding=\"utf-8\")))\n",
+    "    print(f\"{sp}: vqa={nv:,}  manifest copied\")\n",
+    "\n",
+    "n_rep = sum(1 for _ in (OUT/'files').rglob('s*.txt'))\n",
+    "print(f\"reports trong package: {n_rep:,}\")"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity check cuối — đếm theo manifest (image + report tồn tại thực sự)\n",
+    "for sp in (\"train\",\"val\",\"test\"):\n",
+    "    rows = manifests[sp]\n",
+    "    ni = sum(1 for x in rows if (OUT/x[\"image_relpath\"]).exists())\n",
+    "    nr = sum(1 for x in rows if (OUT/x[\"report_relpath\"]).exists())\n",
+    "    print(f\"{sp:5s}  manifest={len(rows):,}  images_ok={ni:,}  reports_ok={nr:,}\")\n",
+    "print(\"\\nPackage:\", OUT)"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": [
+    "## 4. Upload lên Hugging Face\n",
+    "\n",
+    "Push vào repo **có sẵn** `hieu3636/cxr-vlm-data`, nằm trong thư mục con `MIMIC-CXR_processed/` (dùng `path_in_repo`, không tạo repo mới)."
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "RUN_HF = False   # ← bật khi sẵn sàng push\n",
+    "if RUN_HF:\n",
+    "    from huggingface_hub import HfApi\n",
+    "    api = HfApi(token=HF_TOKEN)\n",
+    "    # repo đã tồn tại sẵn -> exist_ok=True chỉ no-op, không tạo mới\n",
+    "    api.create_repo(HF_REPO_ID, repo_type=HF_REPO_TYPE, exist_ok=True)\n",
+    "    # upload_folder hỗ trợ path_in_repo -> đẩy vào thư mục con MIMIC-CXR_processed\n",
+    "    # (chạy lại nếu đứt: file đã có trên repo được bỏ qua theo hash)\n",
+    "    api.upload_folder(\n",
+    "        repo_id      = HF_REPO_ID,\n",
+    "        repo_type    = HF_REPO_TYPE,\n",
+    "        folder_path  = str(OUT),\n",
+    "        path_in_repo = HF_PATH_IN_REPO,\n",
+    "        commit_message = \"Add MIMIC-CXR_processed subset\",\n",
+    "    )\n",
+    "    print(\"pushed →\",\n",
+    "          f\"https://huggingface.co/{HF_REPO_TYPE}s/{HF_REPO_ID}/tree/main/{HF_PATH_IN_REPO}\")\n",
+    "else:\n",
+    "    print(\"RUN_HF=False — bật True để push vào\",\n",
+    "          f\"{HF_REPO_ID}/{HF_PATH_IN_REPO}\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "RUN_ZIP = False\n",
+    "if RUN_ZIP:\n",
+    "    # zip cả package (giữ cấu trúc gốc) thành 1 file\n",
+    "    shutil.make_archive(\"/content/MIMIC-CXR_processed\", \"zip\", OUT)\n",
+    "    shutil.copy(\"/content/MIMIC-CXR_processed.zip\", DRIVE/\"MIMIC-CXR_processed.zip\")\n",
+    "    print(\"zipped -> Drive/MIMIC-CXR_processed.zip\")\n",
+    "else:\n",
+    "    print(\"RUN_ZIP=False\")"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"=\"*54)\n",
+    "print(\"  PHASE 2 (COLAB) DONE\")\n",
+    "print(\"=\"*54)\n",
+    "for sp in (\"train\",\"val\",\"test\"):\n",
+    "    rows=manifests[sp]\n",
+    "    ni=sum(1 for x in rows if (OUT/x[\"image_relpath\"]).exists())\n",
+    "    print(f\"  {sp:5s}  images={ni:,}/{len(rows):,}\")\n",
+    "print(f\"  package: {OUT}\")\n",
+    "print(\"  cấu trúc gốc: files/pXX/pSUBJ/sSTUDY/<dicom>.jpg  +  .txt\")\n",
+    "print(\"  Flags: RUN_HF / RUN_ZIP (mặc định False)\")\n",
+    "print(\"=\"*54)"
+   ]
   }
  ],
  "metadata": {
  },
  "nbformat": 4,
  "nbformat_minor": 5
+}

data/build_subset_local.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

data/eda_full.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

data/eda_p18.ipynb CHANGED Viewed

@@ -26,13 +26,43 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "from pathlib import Path\n\nDATA_DIR = Path(r\"D:\\USTH\\KLTN\\cxr-vlm-data\")\nCXR_ROOT = DATA_DIR / \"mimic-cxr-reports\"   # files/p10…p19/pXXXXXX/sYYYYYY.txt\n\nSPLIT_CSV    = DATA_DIR / \"mimic-cxr-2.0.0-split.csv\"\nMETA_CSV     = DATA_DIR / \"mimic-cxr-2.0.0-metadata.csv\"\nCHEXPERT_CSV = DATA_DIR / \"mimic-cxr-2.0.0-chexpert.csv\"\n\n_VQA_DIR = (DATA_DIR\n    / \"mimic-ext-mimic-cxr-vqa-a-complex-diverse-and-large-scale-visual-question-answering-dataset-for-chest-x-ray-images-1.0.0\"\n    / \"MIMIC-Ext-MIMIC-CXR-VQA\"\n    / \"dataset\")\nVQA_TRAIN = _VQA_DIR / \"train.json\"\nVQA_VALID = _VQA_DIR / \"valid.json\"\nVQA_TEST  = _VQA_DIR / \"test.json\"\n\n# Kiểm tra nhanh\nfor name, p in [(\"SPLIT_CSV\",    SPLIT_CSV),\n                (\"META_CSV\",     META_CSV),\n                (\"CHEXPERT_CSV\", CHEXPERT_CSV),\n                (\"CXR_ROOT\",     CXR_ROOT),\n                (\"VQA_TRAIN\",    VQA_TRAIN)]:\n    status = \"✓\" if p.exists() else \"✗ NOT FOUND\"\n    print(f\"  {status}  {name}: {p}\")\n\nprint(\"\\nPaths configured.\")"
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -61,6 +91,7 @@
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 1. Load & lọc subset p18"
@@ -69,6 +100,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,6 +121,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -104,6 +137,7 @@
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 2. Tổng quan số lượng ảnh & report theo split"
@@ -112,6 +146,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -136,6 +171,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -153,6 +189,7 @@
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 3. Số ảnh mỗi study (1 study → bao nhiêu ảnh?)"
@@ -161,6 +198,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -176,6 +214,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -192,6 +231,7 @@
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 4. Phân bố View Position (AP, PA, Lateral, ...)"
@@ -200,6 +240,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -212,6 +253,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -237,6 +279,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -256,27 +299,104 @@
   {
    "cell_type": "markdown",
    "id": "ae9f3d3c",
-   "source": "## 4b. Frontal-Only Sampling Strategy (AP > PA)\n\nChiến lược train: **1 report + 1 ảnh frontal** mỗi study.\n- Chỉ giữ AP hoặc PA; nếu study có cả hai thì **ưu tiên AP**.\n- Study không có ảnh frontal nào → loại khỏi tập train.",
-   "metadata": {}
   },
   {
    "cell_type": "code",
    "id": "d2ce6beb",
-   "source": "frontal = df[df[\"ViewPosition\"].isin([\"AP\", \"PA\"])].copy()\n\n# Với mỗi study: chọn AP trước, nếu không có thì chọn PA (lấy 1 ảnh duy nhất)\ndef pick_frontal_view(group):\n    ap = group[group[\"ViewPosition\"] == \"AP\"]\n    if len(ap) > 0:\n        return ap.iloc[[0]]\n    return group[group[\"ViewPosition\"] == \"PA\"].iloc[[0]]\n\nfrontal_1img = (\n    frontal.groupby(\"study_id\", group_keys=False)\n    .apply(pick_frontal_view)\n    .reset_index(drop=True)\n)\n\n# Thống kê tổng quan\nn_study_total    = df[\"study_id\"].nunique()\nn_study_frontal  = frontal_1img[\"study_id\"].nunique()\nn_study_no_front = n_study_total - n_study_frontal\n\nprint(\"=== Frontal-Only Sampling (p18) ===\")\nprint(f\"Tổng số study                   : {n_study_total:,}\")\nprint(f\"Study có ảnh frontal (AP/PA)    : {n_study_frontal:,}  ({n_study_frontal/n_study_total*100:.1f}%)\")\nprint(f\"Study bị loại (không có frontal): {n_study_no_front:,}  ({n_study_no_front/n_study_total*100:.1f}%)\")\nprint()\nprint(f\"Ảnh được chọn theo view:\")\nprint(frontal_1img[\"ViewPosition\"].value_counts().to_string())\nprint()\nprint(\"=== Mẫu train sau khi filter (split) ===\")\nsplit_frontal = frontal_1img[\"split\"].value_counts().reindex([\"train\", \"validate\", \"test\"])\nsplit_all     = df.drop_duplicates(\"study_id\")[\"split\"].value_counts().reindex([\"train\", \"validate\", \"test\"])\ncompare = pd.DataFrame({\n    \"All studies\": split_all,\n    \"Frontal-only\": split_frontal,\n    \"Giảm (%)\": ((split_all - split_frontal) / split_all * 100).round(1)\n})\nprint(compare.to_string())",
    "metadata": {},
-   "execution_count": null,
-   "outputs": []
   },
   {
    "cell_type": "code",
    "id": "9d4aaf5c",
-   "source": "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n\n# 1. All vs Frontal-only (study count)\ncats = [\"All studies\", \"Frontal-only\"]\nvals = [n_study_total, n_study_frontal]\nbars = axes[0].bar(cats, vals, color=[\"#4C72B0\", \"#55A868\"], width=0.5)\naxes[0].bar_label(bars, fmt=\"%d\")\naxes[0].set_title(\"Study count: All vs Frontal-only\")\naxes[0].set_ylabel(\"Số study\")\n\n# 2. View breakdown của ảnh được chọn\nvc = frontal_1img[\"ViewPosition\"].value_counts()\naxes[1].pie(vc.values, labels=vc.index, autopct=\"%1.1f%%\",\n            colors=[\"#4C72B0\", \"#DD8452\"])\naxes[1].set_title(\"View được chọn (AP ưu tiên)\")\n\n# 3. So sánh train/val/test\nx = np.arange(3)\nw = 0.35\nsplits = [\"train\", \"validate\", \"test\"]\naxes[2].bar(x - w/2, split_all.values,     w, label=\"All\",          color=\"#4C72B0\", alpha=0.85)\naxes[2].bar(x + w/2, split_frontal.values, w, label=\"Frontal-only\", color=\"#55A868\", alpha=0.85)\naxes[2].set_xticks(x)\naxes[2].set_xticklabels(splits)\naxes[2].set_title(\"Frontal-only vs All (per split)\")\naxes[2].set_ylabel(\"Số study\")\naxes[2].legend()\n\nplt.suptitle(\"Frontal-Only Sampling Strategy — p18\", fontsize=13)\nplt.tight_layout()\nplt.show()",
    "metadata": {},
-   "execution_count": null,
-   "outputs": []
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 5. CheXpert Labels — 14 nhãn bệnh lý"
@@ -285,6 +405,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -311,13 +432,33 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "# Headers hành chính — không phải findings\nADMIN_HEADERS = {\n    'EXAMINATION', 'INDICATION', 'CLINICAL INDICATION', 'TECHNIQUE',\n    'COMPARISON', 'HISTORY', 'REASON', 'REASON FOR EXAM',\n    'REASON FOR EXAMINATION', 'PROCEDURE', 'FINAL REPORT',\n    'NOTIFICATION', 'RECOMMENDATION', 'ADDENDUM'\n}\n\n# Detect section header: dòng bắt đầu bằng ALL-CAPS (có thể có space/dấu câu) rồi đến \":\"\nSECTION_RE = re.compile(r'^[ \\t]*([A-Z][A-Z ,/()\\-]{1,70}?):\\s*', re.MULTILINE)\n\ndef parse_report(txt_path: Path) -> dict:\n    \"\"\"\n    Parse report .txt thành dict {'findings': str|None, 'impression': str|None}.\n\n    Quy luật detect section: mọi header đều VIẾT HOA TOÀN BỘ và kết thúc bằng ':',\n    ví dụ: FINDINGS:, IMPRESSION:, FRONTAL AND LATERAL VIEWS OF THE CHEST:\n    → dùng regex bắt pattern đó, không hardcode từng keyword.\n\n    Nếu không có section FINDINGS tường minh, fallback lấy section\n    descriptive đầu tiên (không phải admin header).\n    \"\"\"\n    try:\n        text = txt_path.read_text(encoding=\"utf-8\", errors=\"ignore\")\n    except FileNotFoundError:\n        return {\"findings\": None, \"impression\": None}\n\n    matches = list(SECTION_RE.finditer(text))\n    if not matches:\n        return {\"findings\": None, \"impression\": None}\n\n    # Tách từng section thành (header, content)\n    sections = []\n    for i, m in enumerate(matches):\n        header  = m.group(1).strip()\n        start   = m.end()\n        end     = matches[i + 1].start() if i + 1 < len(matches) else len(text)\n        content = text[start:end].strip()\n        sections.append((header, content))\n\n    findings = impression = None\n    for header, content in sections:\n        h = header.upper()\n        if \"FINDING\" in h and findings is None:\n            findings = content or None\n        elif \"IMPRESSION\" in h and impression is None:\n            impression = content or None\n\n    # Fallback: không có FINDINGS tường minh → lấy section descriptive đầu tiên\n    if findings is None:\n        for header, content in sections:\n            h = header.upper()\n            if h not in ADMIN_HEADERS and \"IMPRESSION\" not in h and content:\n                findings = content\n                break\n\n    return {\"findings\": findings, \"impression\": impression}\n\n\n# Lấy danh sách unique studies trong p18\np18_studies = (\n    df[[\"subject_id\", \"study_id\"]]\n    .drop_duplicates(\"study_id\")\n    .reset_index(drop=True)\n)\n\nprint(f\"Số study cần parse: {len(p18_studies):,}\")\nprint(\"Parsing reports...\")\n\nrecords = []\nfor _, row in p18_studies.iterrows():\n    sid  = str(row[\"subject_id\"])\n    stid = str(row[\"study_id\"])\n    txt_path = CXR_ROOT / \"files\" / \"p18\" / f\"p{sid}\" / f\"s{stid}.txt\"\n    parsed = parse_report(txt_path)\n    records.append({\"study_id\": stid, **parsed})\n\nreport_df = pd.DataFrame(records)\nreport_df[\"findings_len\"]   = report_df[\"findings\"].str.split().str.len()\nreport_df[\"impression_len\"] = report_df[\"impression\"].str.split().str.len()\n\ntotal = len(report_df)\nprint(f\"\\nFindings   found : {report_df['findings'].notna().sum():,} / {total:,}  ({report_df['findings'].notna().mean()*100:.1f}%)\")\nprint(f\"Impression found : {report_df['impression'].notna().sum():,} / {total:,}  ({report_df['impression'].notna().mean()*100:.1f}%)\")\nboth    = (report_df['findings'].notna() & report_df['impression'].notna()).sum()\nneither = (report_df['findings'].isna()  & report_df['impression'].isna()).sum()\nprint(f\"Cả hai           : {both:,} / {total:,}  ({both/total*100:.1f}%)\")\nprint(f\"Không có cả hai  : {neither:,} / {total:,}  ({neither/total*100:.1f}%)\")"
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -338,6 +479,7 @@
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 6. Phân tích Report — Findings & Impression"
@@ -346,6 +488,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -405,6 +548,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -418,6 +562,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -451,6 +596,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -471,6 +617,7 @@
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 7. VQA — phân tích câu hỏi & đáp"
@@ -479,6 +626,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -506,6 +654,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -517,6 +666,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -534,6 +684,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -559,31 +710,10 @@
     "plt.show()"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "c313b9c3",
-   "source": "### VQA × View Position — mẫu hỏi đáp thuộc ảnh view nào",
-   "metadata": {}
-  },
-  {
-   "cell_type": "code",
-   "id": "0791482f",
-   "source": "# image_id trong VQA = dicom_id trong metadata\nvqa_view = vqa_p18.merge(\n    p18_meta[[\"dicom_id\", \"ViewPosition\"]],\n    left_on=\"image_id\", right_on=\"dicom_id\",\n    how=\"left\"\n)\n\nmissing_view_vqa = vqa_view[\"ViewPosition\"].isna().sum()\nvqa_view[\"ViewPosition\"] = vqa_view[\"ViewPosition\"].fillna(\"Unknown\")\n\nview_vqa_counts = vqa_view[\"ViewPosition\"].value_counts()\nprint(\"=== VQA samples theo View Position (p18) ===\")\nprint(view_vqa_counts.to_string())\nprint(f\"\\nKhông map được ViewPosition: {missing_view_vqa:,} ({missing_view_vqa/len(vqa_view)*100:.1f}%)\")",
-   "metadata": {},
-   "execution_count": null,
-   "outputs": []
-  },
-  {
-   "cell_type": "code",
-   "id": "049baaef",
-   "source": "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n\n# 1. Bar: số mẫu VQA theo view\nbars = axes[0].bar(view_vqa_counts.index, view_vqa_counts.values,\n                   color=sns.color_palette(\"Set2\", len(view_vqa_counts)))\naxes[0].bar_label(bars, fmt=\"%d\")\naxes[0].set_title(\"Số mẫu VQA theo View Position\")\naxes[0].set_ylabel(\"Số mẫu\")\n\n# 2. Pie\naxes[1].pie(view_vqa_counts.values, labels=view_vqa_counts.index,\n            autopct=\"%1.1f%%\", colors=sns.color_palette(\"Set2\", len(view_vqa_counts)))\naxes[1].set_title(\"Tỉ lệ VQA theo View Position\")\n\n# 3. Semantic type × View (stacked bar)\nsem_view = vqa_view.groupby([\"ViewPosition\", \"semantic_type\"]).size().unstack(fill_value=0)\nsem_view.plot(kind=\"bar\", ax=axes[2], color=sns.color_palette(\"Set1\", sem_view.shape[1]),\n              width=0.7, stacked=True)\naxes[2].set_title(\"Semantic Type × View Position\")\naxes[2].set_xlabel(\"View Position\")\naxes[2].set_ylabel(\"Số mẫu\")\naxes[2].tick_params(axis=\"x\", rotation=30)\naxes[2].legend(title=\"Semantic Type\", fontsize=8)\n\nplt.suptitle(\"VQA × View Position — p18\", fontsize=13)\nplt.tight_layout()\nplt.show()\n\n# Content type × View\nprint(\"\\nContent type theo View Position:\")\nprint(vqa_view.groupby([\"ViewPosition\", \"content_type\"]).size()\n      .unstack(fill_value=0).to_string())",
-   "metadata": {},
-   "execution_count": null,
-   "outputs": []
-  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -602,6 +732,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -627,6 +758,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -657,6 +789,7 @@
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 8. Gợi ý thêm — Missing data & Data Quality"
@@ -665,6 +798,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -682,6 +816,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -693,6 +828,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -711,6 +847,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -733,6 +870,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -743,12 +881,14 @@
     "\n",
     "    res_counts = df.groupby([\"Rows\", \"Columns\"]).size().sort_values(ascending=False).head(15)\n",
     "    print(\"\\nTop-15 resolutions:\")\n",
-    "    print(res_counts.to_string())\nelse:\n",
     "    print(\"Cột Rows/Columns không có trong metadata.\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 9. Tóm tắt (Summary)"
@@ -757,6 +897,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -789,9 +930,9 @@
   },
   "language_info": {
    "name": "python",
-   "version": "3.10.0"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}

   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "a4238924",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "DATA_DIR = Path(r\"D:\\USTH\\KLTN\\cxr-vlm-data\")\n",
+    "CXR_ROOT = DATA_DIR / \"mimic-cxr-reports\"   # files/p10…p19/pXXXXXX/sYYYYYY.txt\n",
+    "\n",
+    "SPLIT_CSV    = DATA_DIR / \"mimic-cxr-2.0.0-split.csv\"\n",
+    "META_CSV     = DATA_DIR / \"mimic-cxr-2.0.0-metadata.csv\"\n",
+    "CHEXPERT_CSV = DATA_DIR / \"mimic-cxr-2.0.0-chexpert.csv\"\n",
+    "\n",
+    "_VQA_DIR = (DATA_DIR\n",
+    "    / \"mimic-ext-mimic-cxr-vqa-a-complex-diverse-and-large-scale-visual-question-answering-dataset-for-chest-x-ray-images-1.0.0\"\n",
+    "    / \"MIMIC-Ext-MIMIC-CXR-VQA\"\n",
+    "    / \"dataset\")\n",
+    "VQA_TRAIN = _VQA_DIR / \"train.json\"\n",
+    "VQA_VALID = _VQA_DIR / \"valid.json\"\n",
+    "VQA_TEST  = _VQA_DIR / \"test.json\"\n",
+    "\n",
+    "# Kiểm tra nhanh\n",
+    "for name, p in [(\"SPLIT_CSV\",    SPLIT_CSV),\n",
+    "                (\"META_CSV\",     META_CSV),\n",
+    "                (\"CHEXPERT_CSV\", CHEXPERT_CSV),\n",
+    "                (\"CXR_ROOT\",     CXR_ROOT),\n",
+    "                (\"VQA_TRAIN\",    VQA_TRAIN)]:\n",
+    "    status = \"✓\" if p.exists() else \"✗ NOT FOUND\"\n",
+    "    print(f\"  {status}  {name}: {p}\")\n",
+    "\n",
+    "print(\"\\nPaths configured.\")"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "99828a70",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "markdown",
+   "id": "4674dd4f",
    "metadata": {},
    "source": [
     "## 1. Load & lọc subset p18"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "9f1d59fe",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "6657d6ec",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "markdown",
+   "id": "5a7bff47",
    "metadata": {},
    "source": [
     "## 2. Tổng quan số lượng ảnh & report theo split"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "81be327d",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "80fa39e7",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "markdown",
+   "id": "4fed2aa0",
    "metadata": {},
    "source": [
     "## 3. Số ảnh mỗi study (1 study → bao nhiêu ảnh?)"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "39b23ccb",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "b8c6560b",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "markdown",
+   "id": "0262e14a",
    "metadata": {},
    "source": [
     "## 4. Phân bố View Position (AP, PA, Lateral, ...)"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cad06cc2",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "d86b2102",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "d8f24892",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "markdown",
    "id": "ae9f3d3c",
+   "metadata": {},
+   "source": [
+    "## 4b. Frontal-Only Sampling Strategy (AP > PA)\n",
+    "\n",
+    "Chiến lược train: **1 report + 1 ảnh frontal** mỗi study.\n",
+    "- Chỉ giữ AP hoặc PA; nếu study có cả hai thì **ưu tiên AP**.\n",
+    "- Study không có ảnh frontal nào → loại khỏi tập train."
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "d2ce6beb",
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "frontal = df[df[\"ViewPosition\"].isin([\"AP\", \"PA\"])].copy()\n",
+    "\n",
+    "# Với mỗi study: chọn AP trước, nếu không có thì chọn PA (lấy 1 ảnh duy nhất)\n",
+    "def pick_frontal_view(group):\n",
+    "    ap = group[group[\"ViewPosition\"] == \"AP\"]\n",
+    "    if len(ap) > 0:\n",
+    "        return ap.iloc[[0]]\n",
+    "    return group[group[\"ViewPosition\"] == \"PA\"].iloc[[0]]\n",
+    "\n",
+    "frontal_1img = (\n",
+    "    frontal.groupby(\"study_id\", group_keys=False)\n",
+    "    .apply(pick_frontal_view)\n",
+    "    .reset_index(drop=True)\n",
+    ")\n",
+    "\n",
+    "# Thống kê tổng quan\n",
+    "n_study_total    = df[\"study_id\"].nunique()\n",
+    "n_study_frontal  = frontal_1img[\"study_id\"].nunique()\n",
+    "n_study_no_front = n_study_total - n_study_frontal\n",
+    "\n",
+    "print(\"=== Frontal-Only Sampling (p18) ===\")\n",
+    "print(f\"Tổng số study                   : {n_study_total:,}\")\n",
+    "print(f\"Study có ảnh frontal (AP/PA)    : {n_study_frontal:,}  ({n_study_frontal/n_study_total*100:.1f}%)\")\n",
+    "print(f\"Study bị loại (không có frontal): {n_study_no_front:,}  ({n_study_no_front/n_study_total*100:.1f}%)\")\n",
+    "print()\n",
+    "print(f\"Ảnh được chọn theo view:\")\n",
+    "print(frontal_1img[\"ViewPosition\"].value_counts().to_string())\n",
+    "print()\n",
+    "print(\"=== Mẫu train sau khi filter (split) ===\")\n",
+    "split_frontal = frontal_1img[\"split\"].value_counts().reindex([\"train\", \"validate\", \"test\"])\n",
+    "split_all     = df.drop_duplicates(\"study_id\")[\"split\"].value_counts().reindex([\"train\", \"validate\", \"test\"])\n",
+    "compare = pd.DataFrame({\n",
+    "    \"All studies\": split_all,\n",
+    "    \"Frontal-only\": split_frontal,\n",
+    "    \"Giảm (%)\": ((split_all - split_frontal) / split_all * 100).round(1)\n",
+    "})\n",
+    "print(compare.to_string())"
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "9d4aaf5c",
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n",
+    "\n",
+    "# 1. All vs Frontal-only (study count)\n",
+    "cats = [\"All studies\", \"Frontal-only\"]\n",
+    "vals = [n_study_total, n_study_frontal]\n",
+    "bars = axes[0].bar(cats, vals, color=[\"#4C72B0\", \"#55A868\"], width=0.5)\n",
+    "axes[0].bar_label(bars, fmt=\"%d\")\n",
+    "axes[0].set_title(\"Study count: All vs Frontal-only\")\n",
+    "axes[0].set_ylabel(\"Số study\")\n",
+    "\n",
+    "# 2. View breakdown của ảnh được chọn\n",
+    "vc = frontal_1img[\"ViewPosition\"].value_counts()\n",
+    "axes[1].pie(vc.values, labels=vc.index, autopct=\"%1.1f%%\",\n",
+    "            colors=[\"#4C72B0\", \"#DD8452\"])\n",
+    "axes[1].set_title(\"View được chọn (AP ưu tiên)\")\n",
+    "\n",
+    "# 3. So sánh train/val/test\n",
+    "x = np.arange(3)\n",
+    "w = 0.35\n",
+    "splits = [\"train\", \"validate\", \"test\"]\n",
+    "axes[2].bar(x - w/2, split_all.values,     w, label=\"All\",          color=\"#4C72B0\", alpha=0.85)\n",
+    "axes[2].bar(x + w/2, split_frontal.values, w, label=\"Frontal-only\", color=\"#55A868\", alpha=0.85)\n",
+    "axes[2].set_xticks(x)\n",
+    "axes[2].set_xticklabels(splits)\n",
+    "axes[2].set_title(\"Frontal-only vs All (per split)\")\n",
+    "axes[2].set_ylabel(\"Số study\")\n",
+    "axes[2].legend()\n",
+    "\n",
+    "plt.suptitle(\"Frontal-Only Sampling Strategy — p18\", fontsize=13)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
   },
   {
    "cell_type": "markdown",
+   "id": "28847d0b",
    "metadata": {},
    "source": [
     "## 5. CheXpert Labels — 14 nhãn bệnh lý"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "410fbdbe",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "50c9a91d",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(12, 5))\n",
+    "x = np.arange(len(label_cols))\n",
+    "w = 0.25\n",
+    "\n",
+    "ordered_labels = label_summary.sort_values(\"Positive\", ascending=False).index.tolist()\n",
+    "\n",
+    "ax.bar(x - w, label_summary.loc[ordered_labels, \"Positive\"],  width=w, label=\"Positive\",   color=\"#e74c3c\")\n",
+    "ax.bar(x,     label_summary.loc[ordered_labels, \"Uncertain\"], width=w, label=\"Uncertain\",  color=\"#f39c12\")\n",
+    "ax.bar(x + w, label_summary.loc[ordered_labels, \"Negative\"],  width=w, label=\"Negative\",   color=\"#2ecc71\")\n",
+    "\n",
+    "ax.set_xticks(x)\n",
+    "ax.set_xticklabels(ordered_labels, rotation=40, ha=\"right\", fontsize=9)\n",
+    "ax.set_ylabel(\"Số study\")\n",
+    "ax.set_title(\"CheXpert Labels — Positive / Uncertain / Negative (p18)\")\n",
+    "ax.legend()\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "1e1209c5",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "markdown",
+   "id": "f0aa1ba8",
    "metadata": {},
    "source": [
     "## 6. Phân tích Report — Findings & Impression"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "8b1e562e",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c49a401a",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "942959d1",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "170b0971",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "markdown",
+   "id": "5512e3aa",
    "metadata": {},
    "source": [
     "## 7. VQA — phân tích câu hỏi & đáp"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "7caa394c",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "ddb012a8",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "86eec60e",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "4567a110",
    "metadata": {},
    "outputs": [],
    "source": [
     "plt.show()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "f968f772",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "97179573",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "9ffe116e",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "markdown",
+   "id": "37f8ee29",
    "metadata": {},
    "source": [
     "## 8. Gợi ý thêm — Missing data & Data Quality"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c0a10b57",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "f2fe0d2e",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "4b3a9176",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "ea4da928",
    "metadata": {},
    "outputs": [],
    "source": [
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "9b990ae5",
    "metadata": {},
    "outputs": [],
    "source": [
     "\n",
     "    res_counts = df.groupby([\"Rows\", \"Columns\"]).size().sort_values(ascending=False).head(15)\n",
     "    print(\"\\nTop-15 resolutions:\")\n",
+    "    print(res_counts.to_string())\n",
+    "else:\n",
     "    print(\"Cột Rows/Columns không có trong metadata.\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "a03900eb",
    "metadata": {},
    "source": [
     "## 9. Tóm tắt (Summary)"
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "f8cc6c50",
    "metadata": {},
    "outputs": [],
    "source": [
   },
   "language_info": {
    "name": "python",
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 5
+}

data/eda_reports.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

data/img_stat.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import pandas as pd
+import matplotlib.pyplot as plt
+# Đọc file CSV
+# Thay "images.csv" bằng đường dẫn file của bạn
+df = pd.read_csv(r"D:\USTH\KLTN\cxr-vlm-data\mimic-cxr-2.0.0-metadata.csv")
+# Kiểm tra các cột cần thiết
+required_cols = ["Rows", "Columns"]
+for col in required_cols:
+    if col not in df.columns:
+        raise ValueError(f"Thiếu cột: {col}")
+# Tạo thêm cột diện tích ảnh
+df["TotalPixels"] = df["Rows"] * df["Columns"]
+# Thống kê cơ bản
+print("===== THỐNG KÊ KÍCH THƯỚC ẢNH =====")
+print(df[["Rows", "Columns", "TotalPixels"]].describe())
+# Tỉ lệ khung hình
+df["AspectRatio"] = df["Columns"] / df["Rows"]
+print("\n===== THỐNG KÊ TỈ LỆ KHUNG HÌNH =====")
+print(df["AspectRatio"].describe())
+# -------------------------
+# Histogram chiều cao
+# -------------------------
+plt.figure(figsize=(8, 5))
+plt.hist(df["Rows"], bins=30)
+plt.xlabel("Rows (Height)")
+plt.ylabel("Number of Images")
+plt.title("Distribution of Image Heights")
+plt.grid(True)
+plt.show()
+# -------------------------
+# Histogram chiều rộng
+# -------------------------
+plt.figure(figsize=(8, 5))
+plt.hist(df["Columns"], bins=30)
+plt.xlabel("Columns (Width)")
+plt.ylabel("Number of Images")
+plt.title("Distribution of Image Widths")
+plt.grid(True)
+plt.show()
+# -------------------------
+# Histogram tổng pixel
+# -------------------------
+plt.figure(figsize=(8, 5))
+plt.hist(df["TotalPixels"], bins=30)
+plt.xlabel("Total Pixels")
+plt.ylabel("Number of Images")
+plt.title("Distribution of Image Sizes")
+plt.grid(True)
+plt.show()
+# -------------------------
+# Scatter plot Width vs Height
+# -------------------------
+plt.figure(figsize=(7, 7))
+plt.scatter(df["Columns"], df["Rows"], alpha=0.5)
+plt.xlabel("Width (Columns)")
+plt.ylabel("Height (Rows)")
+plt.title("Image Resolution Distribution")
+plt.grid(True)
+plt.show()
+# -------------------------
+# Top resolution phổ biến nhất
+# -------------------------
+resolution_counts = (
+    df.groupby(["Rows", "Columns"])
+      .size()
+      .reset_index(name="Count")
+      .sort_values("Count", ascending=False)
+)
+print("\n===== TOP RESOLUTION PHỔ BIẾN =====")
+print(resolution_counts.head(10))

dev/eval_labelers.py CHANGED Viewed

@@ -11,15 +11,15 @@ from sklearn.metrics import (
 )
 # ── Cấu hình — chỉnh 4 dòng này ──────────────────────────────────────────────
-CHEXPERT_PATH = r"mimic-cxr-2.0.0-chexpert.csv.gz"
-NEGBIO_PATH   = r"mimic-cxr-2.0.0-negbio.csv.gz"
-GT_PATH       = r"mimic-cxr-2.1.0-test-set-labeled.csv"
 # Cách xử lý nhãn uncertain (-1):
 #   "positive" → coi là có bệnh (mặc định, conservative)
 #   "negative" → coi là không có bệnh
 #   "drop"     → bỏ hẳn các study có uncertain
-UNCERTAIN_STRATEGY = "positive"
 # ─────────────────────────────────────────────────────────────────────────────
 PATHOLOGIES = [
@@ -133,7 +133,7 @@ def main():
     summary = pd.DataFrame([res_chx, res_neg]).set_index("tool")
     print("=" * 60)
-    print("OVERALL METRICS (uncertain strategy: '{}')".format(args.uncertain))
     print("=" * 60)
     print(summary.to_string(float_format="{:.4f}".format))

 )
 # ── Cấu hình — chỉnh 4 dòng này ──────────────────────────────────────────────
+CHEXPERT_PATH = r"D:\USTH\KLTN\cxr-vlm-data\mimic-cxr-2.0.0-chexpert.csv"
+NEGBIO_PATH   = r"D:\USTH\KLTN\cxr-vlm-data\mimic-cxr-2.0.0-negbio.csv"
+GT_PATH       = r"D:\USTH\KLTN\cxr-vlm-data\mimic-cxr-2.1.0-test-set-labeled.csv"
 # Cách xử lý nhãn uncertain (-1):
 #   "positive" → coi là có bệnh (mặc định, conservative)
 #   "negative" → coi là không có bệnh
 #   "drop"     → bỏ hẳn các study có uncertain
+UNCERTAIN_STRATEGY = "negative"
 # ─────────────────────────────────────────────────────────────────────────────
 PATHOLOGIES = [
     summary = pd.DataFrame([res_chx, res_neg]).set_index("tool")
     print("=" * 60)
+    print("OVERALL METRICS (uncertain strategy: '{}')".format(UNCERTAIN_STRATEGY))
     print("=" * 60)
     print(summary.to_string(float_format="{:.4f}".format))

dev/labeler_comparison.csv ADDED Viewed

	@@ -0,0 +1,14 @@

+,CheXpert_F1,NegBio_F1,Winner,Δ
+Atelectasis,0.9465648854961832,0.9518987341772152,NegBio,-0.0053
+Cardiomegaly,0.8363636363636363,0.8466666666666667,NegBio,-0.0103
+Consolidation,0.832,0.907563025210084,NegBio,-0.0756
+Edema,0.8897058823529411,0.8805970149253731,CheXpert,0.0091
+Enlarged Cardiomediastinum,0.5,0.49019607843137253,CheXpert,0.0098
+Fracture,0.8131868131868132,0.8470588235294118,NegBio,-0.0339
+Lung Lesion,0.7889908256880734,0.7850467289719626,CheXpert,0.0039
+No Finding,0.5376344086021505,0.5252525252525253,CheXpert,0.0124
+Pleural Effusion,0.9265536723163842,0.9274809160305344,NegBio,-0.0009
+Pleural Other,0.4918032786885246,0.5,NegBio,-0.0082
+Pneumonia,0.5362318840579711,0.5801526717557252,NegBio,-0.0439
+Pneumothorax,0.7083333333333334,0.7560975609756098,NegBio,-0.0478
+Support Devices,0.8830645161290323,0.8806584362139918,CheXpert,0.0024