# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.). ## Running Data Cleaning Scripts Each dataset has its own `dataclean_*.py` script. Run from the dataset's subdirectory: ```bash python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output ``` All scripts follow the same `--target_path` / `--output_dir` argument pattern. Versioned scripts (e.g., `_v2.py`, `_v3.py`) represent iterative improvements; use the highest version unless investigating regressions. ## Dependencies Python 3 with: `SimpleITK`, `pandas`, `numpy`, `tqdm`, `openpyxl` (for Excel metadata). No requirements.txt exists — install manually. ## Architecture ### Processing Pipeline (per dataset) 1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, or NRRD) 2. **Extract metadata** from headers, CSV files, or DICOM tags 3. **Resample** to isotropic spacing using minimum voxel spacing (`get_unisize_resampler`) 4. **Clamp intensities** — CT: `[-300, 300]` HU; MRI: varies per dataset 5. **Process segmentation labels** with identical resampling (nearest-neighbor interpolation) 6. **Validate** image/label dimension alignment via `assert` on `GetSize()` 7. **Write** standardized NIfTI (`.nii.gz`) + append to `nifti_mappings.json` ### Key Shared Components **`util.py`** (copied into each dataset directory — not a shared import): - `meta_data` class — validates metadata against `config_format.json` schema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionaries - `get_unisize_resampler()` — builds a SimpleITK resampler for isotropic spacing; returns `None` if spacing is already isotropic - `clamp_image()` — HU/intensity clamping via `sitk.ClampImageFilter` - `get_synonyms_dict()` / `replace_synonyms()` — canonical mapping for ROI names, tissue labels, modalities, and task types - `load_nifti()`, `load_dicom_images()`, `save_nifti()` — I/O wrappers that embed `FolderPath` metadata in NIfTI headers **`config_format.json`** (per dataset directory): defines the metadata schema — field types, required flags, and allowed option values. ### Output Structure ``` {output_dir}/{patient_id}/{patient_id}.nii.gz # processed image {output_dir}/{patient_id}/{task}/{tissue}.nii.gz # segmentation labels {output_dir}/nifti_mappings.json # metadata keyed by output path {output_dir}/failed_files.json # files that failed processing ``` ### Dataset-Specific Notes - **AbdomenAtlas**: 25-organ segmentation labels stored as individual NIfTI files per organ; also has `combined_labels.nii.gz` (values 0-25) - **BRATS (2019/2020/2021)**: Multi-modal MRI (FLAIR, T1, T1ce, T2) — each modality processed as a separate sub-modality entry - **MnM2/MnMs**: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon) - **OASIS**: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR) - **OAI_ZIB**: Knee MRI with 6-structure segmentation and clinical grading (WOMAC) - **PSMA**: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant ## Important Conventions - Resampling uses the **minimum** of the original spacing values to create isotropic voxels - Labels are resampled with **nearest-neighbor** interpolation; images use **linear** - The `meta_data` class normalizes terminology automatically — e.g., "chest" maps to "thorax", "seg" maps to "segmentation" - `util.py` is duplicated across directories (not shared via import) — changes must be propagated manually - Code comments and docstrings are frequently in Chinese - Log files (`*.log`) in each directory contain processing run history — these can be large (up to 23 MB)