Data Preparation

Overview

The data pipeline operates in three phases, each tuned to a different bottleneck: (1) selection runs on a local SSD where the CSV metadata of the full MIMIC-CXR distribution lives, (2) image download + HF upload runs on Google Colab so PhysioNet's authenticated transfer can saturate a cloud network, and (3) resize + shard runs on the eventual training box so the JPEGs we actually consume are minimal in size and transfer count. The outputs flow into the unified instruction JSON that the trainer reads.

The same subset is reused across all experiments, so this pipeline is run once; downstream changes (report_mode, prompt template variants, etc.) only rebuild the JSON, not the images.

Data Sources

Three primary PhysioNet resources plus three auxiliary CSVs are required (all credentialed access):

Source	Version	Used for
MIMIC-CXR	2.1.0	Radiology report `.txt` files (`files/pXX/pSUBJ/sSTUDY.txt`)
MIMIC-CXR-JPG	2.1.0	Pre-converted JPEGs (`files/pXX/pSUBJ/sSTUDY/<dicom>.jpg`)
MIMIC-Ext-CXR-VQA	1.0.0	1.4 M (image, question, answer) triples
`mimic-cxr-2.0.0-split.csv`	—	Official patient-disjoint train / validate / test split
`mimic-cxr-2.0.0-metadata.csv`	—	DICOM-level metadata; ViewPosition column used to identify frontal views
`mimic-cxr-2.0.0-chexpert.csv`	—	14 pathology labels per study, automatically extracted by CheXbert from the reference reports

No additional manual annotation is performed; all labels come from the public PhysioNet distribution.

Selection Pipeline (Phase 1, local)

A four-stage filter chain reduces the ~227 k MIMIC-CXR studies to a 50 k working subset.

(a) Frontal-only filtering. Each DICOM is joined with the metadata CSV by dicom_id. Only rows where ViewPosition ∈ {PA, AP} are kept. A study with multiple frontal views (e.g. both PA and AP exposures) is collapsed to a single image, preferring PA over AP (PA is the standard reference projection, AP is reserved for bedside / supine portables). After this step every retained study contributes exactly one image.

(b) Report parsing. Each study's .txt is scanned by a strict regex that recognises section headers in the form ^[A-Z ,/().-]+: and accepts a body only if the header is exactly FINDINGS or IMPRESSION after stripping and upper-casing. We deliberately do not merge synonyms such as CONCLUSION, WET READ, INDICATION, or composite headers like FINDINGS AND IMPRESSION — these vary substantially in style and would dilute the training signal. A study survives this stage only if both sections are present and non-empty.

(c) Length-based outlier removal. The word counts of findings and impression are computed; the upper cutoff is set to Q3 + 1.5·IQR per section. Studies above the cutoff (typically multi-paragraph teaching reports) or below the floor (≥ 2 words for findings, ≥ 1 for impression) are dropped. This trims the long tail without affecting median statistics.

(d) Stratified sampling with patient-disjoint splits. The remaining "eligible pool" is sampled to 40 000 / 5 000 / 5 000 studies for train / val / test. Each study is first assigned a stratum equal to its rarest positive CheXpert label, where the rarity order is computed from the eligible pool (e.g. Pleural Other, Lung Lesion, Fracture sit at the top); studies with no positive label fall into No Finding or None. Sampling allocates the target count to each stratum proportionally to its prevalence, then draws within each stratum, preferring images that have associated VQA pairs (so the VQA task receives more supervision). The val / test pools are populated first from the official PhysioNet splits; if a target count exceeds the official pool (the eligible filter shrinks them substantially) the remainder is drawn from the train pool, and the affected subject IDs are then removed from train. The final three sets are therefore patient-disjoint: no subject appears in more than one split.

Distribution preservation is verified by computing the per-pathology prevalence at three levels — raw full MIMIC, eligible pool, and final subset — and inspecting the absolute deltas. In our run, |Δsubset − eligible| was < 0.7 percentage points on every label, confirming that stratification did not skew clinical coverage.

Manifest and Bundle Construction

For each split, a manifest is emitted as both JSON and CSV. Every row captures one image with the following fields:

study_name           Study_1, Study_2, ...        (running ID, 1..50000)
split                train | val | test
subject_id, study_id integers, as in PhysioNet
subset               pXX bucket (p10..p19)
physionet_study_path files/pXX/pSUBJ/sSTUDY      (relative, preserves PhysioNet layout)
dicom_id             primary key for the image
image_filename       <dicom>.jpg
view                 PA | AP
image_relpath        files/pXX/pSUBJ/sSTUDY/<dicom>.jpg
report_relpath       files/pXX/pSUBJ/sSTUDY.txt
jpg_url              fully qualified PhysioNet download URL
has_vqa              boolean
chex_<Pathology>     14 columns, U-MultiClass values {-1, 0, 1, blank}

Preserving the PhysioNet directory layout means a partially downloaded package can be inspected against the manifest by simple path lookup, which simplifies resume after the long image transfer.

The report .txt files are then copied verbatim into bundle/reports/files/..., and the VQA JSONs are filtered to the selected dicom_ids and rewritten so each entry's image_path points at the in-package location. The whole bundle/ directory (manifests + reports + VQA, no images yet) is compressed to a single ~80 MB zip and uploaded to Google Drive. This concludes Phase 1.

Image Download and Distribution (Phase 2, Colab)

The image transfer is the slowest step and runs on Colab so it can use a fast residential-grade network. PhysioNet's HTTP server rejects Python requests basic-auth but accepts wget --user --password, so each image is fetched by a sub-shell wget call with a 60-second timeout and three retries. Downloads run through a ThreadPoolExecutor with 12 workers — the bottleneck is per-request TLS setup, not bandwidth, so threading helps more than process parallelism.

The pipeline is resume-safe: a per-file log (downloaded.txt) is appended on every success and check-pointed to Drive every 500 images. When the notebook is rerun (Colab disconnects), the log determines what is already on disk and the remaining download list is the manifest minus this set. A 10-image timing test runs first to estimate ETA before committing to the full 50 k transfer.

Once images are on disk, the package is published to a private Hugging Face dataset repository, hieu3636/cxr-vlm-data, under the subdirectory MIMIC-CXR_processed/. Uploading 50 k individual files would saturate Drive's FUSE layer and the HF API's per-file overhead; instead, studies are grouped into tar shards of 500 studies each (~750 MB / shard, ~100 shards total). Each shard stores the JPEG plus the matching report; manifests and VQA JSON are uploaded as separate small files. This brings total upload time from days down to a few hours and lets downstream consumers fetch the entire dataset in ~10 sequential file transfers.

Resize and Re-shard (Phase 3, any GPU box)

The processed package above stores full-resolution JPEGs (~2–3 MP each, ~100 GB total). RAD-DINO's preprocessor resizes the shortest edge to 518 and centre-crops to 518×518 regardless, so feeding it full-res JPEGs wastes I/O on every training step. We therefore resize once, offline to produce a smaller distribution-ready tree.

The resize itself is straightforward: open the image, scale so the shortest edge equals 518 with the longer edge proportional, save as JPEG with quality 90 (4:4:4 chroma) — near-lossless for grayscale chest X-rays. Images whose shorter side is already ≤ 518 are copied verbatim rather than upscaled. A ThreadPoolExecutor (PIL releases the GIL during decode/encode) does this in parallel; for the 50 k subset the entire resize takes ~10 minutes and yields ~5–8 GB total.

The resized tree is then packed into ~2 GB tar shards named cxr-NNNN.tar, written under MIMIC-CXR_resized/shards/. Shards are produced sequentially with a running byte counter, closing the current tar when the threshold is hit. The shards plus the original manifests (re-uploaded for convenience) and VQA JSONs are pushed to the same HF repository under MIMIC-CXR_resized/. This is the artifact every training box consumes.

Consumption on the Training Box

A training run pulls MIMIC-CXR_resized/ via huggingface_hub.snapshot_download (parallel, resumable), extracts the tar shards into a flat files/pXX/pSUBJ/sSTUDY/<dicom>.jpg tree, and then invokes the unified-JSON builder (data/mimic_cxr_resized_builder.py). The builder walks the manifest CSV, applies the configured report_mode (split_cascade in this work) and image_mode (frontal_only_split), bakes the 14 CheXpert labels into the structured-findings PNU string, attaches the matching VQA pairs, and writes the per-sample JSON. The trainer then reads this JSON only — the images themselves are loaded lazily by the Dataset class.

This separation of concerns means experimenting with different prompt formats or task mixes requires nothing more than rebuilding the JSON (seconds); changing the underlying image set (e.g. moving from 50 k subset to full MIMIC) requires a full rerun of Phase 1 only. Phase 2 and Phase 3 each ran exactly once for the entire thesis.