{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EDA — MIMIC-CXR Full Dataset\n", "\n", "**Datasets used:**\n", "- `MIMIC-CXR-JPG` (v2.1.0) — ảnh JPG + CSV metadata\n", "- `MIMIC-CXR` (v2.1.0) — report `.txt` (Findings / Impression)\n", "- `MIMIC-Ext-MIMIC-CXR-VQA` (v1.0.0) — câu hỏi/đáp VQA\n", "\n", "**Scope:** toàn bộ dataset (tất cả subset p10–p19).\n", "\n", "> ℹ️ **Không cần tải ảnh JPG** để chạy notebook này — toàn bộ EDA dựa trên CSV, .txt reports và .json VQA." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Cấu hình đường dẫn" ] }, { "cell_type": "code", "execution_count": 109, "id": "55d00e6c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ✓ SPLIT_CSV: D:\\USTH\\KLTN\\cxr-vlm-data\\mimic-cxr-2.0.0-split.csv\n", " ✓ META_CSV: D:\\USTH\\KLTN\\cxr-vlm-data\\mimic-cxr-2.0.0-metadata.csv\n", " ✓ CHEXPERT_CSV: D:\\USTH\\KLTN\\cxr-vlm-data\\mimic-cxr-2.0.0-chexpert.csv\n", " ✓ CXR_ROOT: D:\\USTH\\KLTN\\cxr-vlm-data\\mimic-cxr-reports\n", " ✓ VQA_TRAIN: D:\\USTH\\KLTN\\cxr-vlm-data\\mimic-ext-mimic-cxr-vqa-a-complex-diverse-and-large-scale-visual-question-answering-dataset-for-chest-x-ray-images-1.0.0\\MIMIC-Ext-MIMIC-CXR-VQA\\dataset\\train.json\n", "\n", "Paths configured.\n" ] } ], "source": [ "from pathlib import Path\n", "\n", "DATA_DIR = Path(r\"D:\\USTH\\KLTN\\cxr-vlm-data\")\n", "CXR_ROOT = DATA_DIR / \"mimic-cxr-reports\" # files/p10…p19/pXXXXXX/sYYYYYY.txt — toàn bộ dataset\n", "\n", "SPLIT_CSV = DATA_DIR / \"mimic-cxr-2.0.0-split.csv\"\n", "META_CSV = DATA_DIR / \"mimic-cxr-2.0.0-metadata.csv\"\n", "CHEXPERT_CSV = DATA_DIR / \"mimic-cxr-2.0.0-chexpert.csv\"\n", "\n", "_VQA_DIR = (DATA_DIR\n", " / \"mimic-ext-mimic-cxr-vqa-a-complex-diverse-and-large-scale-visual-question-answering-dataset-for-chest-x-ray-images-1.0.0\"\n", " / \"MIMIC-Ext-MIMIC-CXR-VQA\"\n", " / \"dataset\")\n", "VQA_TRAIN = _VQA_DIR / \"train.json\"\n", "VQA_VALID = _VQA_DIR / \"valid.json\"\n", "VQA_TEST = _VQA_DIR / \"test.json\"\n", "\n", "# None = parse hết toàn bộ (~227k studies, mất 10-20 phút)\n", "# Số nguyên = sample ngẫu nhiên để chạy nhanh\n", "REPORT_SAMPLE_SIZE = None\n", "\n", "# Kiểm tra nhanh\n", "for name, p in [(\"SPLIT_CSV\", SPLIT_CSV),\n", " (\"META_CSV\", META_CSV),\n", " (\"CHEXPERT_CSV\", CHEXPERT_CSV),\n", " (\"CXR_ROOT\", CXR_ROOT),\n", " (\"VQA_TRAIN\", VQA_TRAIN)]:\n", " status = \"✓\" if p.exists() else \"✗ NOT FOUND\"\n", " print(f\" {status} {name}: {p}\")\n", "\n", "print(\"\\nPaths configured.\")" ] }, { "cell_type": "code", "execution_count": 110, "id": "6705bed1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Libraries imported.\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import json\n", "import re\n", "import matplotlib.pyplot as plt\n", "import matplotlib.ticker as mticker\n", "import seaborn as sns\n", "from collections import Counter\n", "\n", "sns.set_theme(style=\"whitegrid\", palette=\"muted\")\n", "plt.rcParams[\"figure.dpi\"] = 120\n", "plt.rcParams[\"figure.figsize\"] = (11, 4)\n", "\n", "CHEXPERT_LABELS = [\n", " \"Atelectasis\", \"Cardiomegaly\", \"Consolidation\", \"Edema\",\n", " \"Enlarged Cardiomediastinum\", \"Fracture\", \"Lung Lesion\",\n", " \"Lung Opacity\", \"No Finding\", \"Pleural Effusion\",\n", " \"Pleural Other\", \"Pneumonia\", \"Pneumothorax\", \"Support Devices\"\n", "]\n", "\n", "# Subset folders p10–p19\n", "ALL_SUBSETS = [f\"p{i}\" for i in range(10, 20)]\n", "\n", "print(\"Libraries imported.\")" ] }, { "cell_type": "markdown", "id": "95d72d6d", "metadata": {}, "source": [ "## 1. Load CSV files" ] }, { "cell_type": "code", "execution_count": 111, "id": "05afd02e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "split.csv — total images : 377,110\n", "metadata — total images : 377,110\n", "chexpert — total studies : 227,827\n", "\n", "Subsets found in split.csv:\n", "subset\n", "p10 36681\n", "p11 38535\n", "p12 37197\n", "p13 37857\n", "p14 37468\n", "p15 38980\n", "p16 37098\n", "p17 37688\n", "p18 37958\n", "p19 37648\n" ] } ], "source": [ "split_df = pd.read_csv(SPLIT_CSV)\n", "meta_df = pd.read_csv(META_CSV)\n", "chexpert_df = pd.read_csv(CHEXPERT_CSV)\n", "\n", "# Tạo cột subset folder (p10, p11, ..., p19)\n", "def get_subset(subject_id):\n", " return \"p\" + str(subject_id)[:2]\n", "\n", "for df_ in [split_df, meta_df, chexpert_df]:\n", " df_[\"subset\"] = df_[\"subject_id\"].astype(str).str[:2].apply(lambda x: f\"p{x}\")\n", "\n", "print(f\"split.csv — total images : {len(split_df):,}\")\n", "print(f\"metadata — total images : {len(meta_df):,}\")\n", "print(f\"chexpert — total studies : {len(chexpert_df):,}\")\n", "print(f\"\\nSubsets found in split.csv:\")\n", "print(split_df[\"subset\"].value_counts().sort_index().to_string())" ] }, { "cell_type": "code", "execution_count": 112, "id": "2d17213f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Merged shape: (377110, 8)\n" ] }, { "data": { "text/html": [ "
| \n", " | dicom_id | \n", "study_id | \n", "subject_id | \n", "split | \n", "subset | \n", "ViewPosition | \n", "Rows | \n", "Columns | \n", "
|---|---|---|---|---|---|---|---|---|
| 0 | \n", "02aa804e-bde0afdd-112c0b34-7bc16630-4e384014 | \n", "50414267 | \n", "10000032 | \n", "train | \n", "p10 | \n", "PA | \n", "3056 | \n", "2544 | \n", "
| 1 | \n", "174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962 | \n", "50414267 | \n", "10000032 | \n", "train | \n", "p10 | \n", "LATERAL | \n", "3056 | \n", "2544 | \n", "
| 2 | \n", "2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab | \n", "53189527 | \n", "10000032 | \n", "train | \n", "p10 | \n", "PA | \n", "3056 | \n", "2544 | \n", "