| # CLAUDE.md |
|
|
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
|
|
| ## Project Overview |
|
|
| Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.). |
| |
| ## Running Data Cleaning Scripts |
| |
| Each dataset has its own `dataclean_*.py` script. Run from the dataset's subdirectory: |
| |
| ```bash |
| python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output |
| ``` |
| |
| All scripts follow the same `--target_path` / `--output_dir` argument pattern. Versioned scripts (e.g., `_v2.py`, `_v3.py`) represent iterative improvements; use the highest version unless investigating regressions. |
| |
| ## Dependencies |
| |
| Python 3 with: `SimpleITK`, `pandas`, `numpy`, `tqdm`, `openpyxl` (for Excel metadata). No requirements.txt exists β install manually. |
| |
| ## Architecture |
| |
| ### Processing Pipeline (per dataset) |
| |
| 1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, or NRRD) |
| 2. **Extract metadata** from headers, CSV files, or DICOM tags |
| 3. **Resample** to isotropic spacing using minimum voxel spacing (`get_unisize_resampler`) |
| 4. **Clamp intensities** β CT: `[-300, 300]` HU; MRI: varies per dataset |
| 5. **Process segmentation labels** with identical resampling (nearest-neighbor interpolation) |
| 6. **Validate** image/label dimension alignment via `assert` on `GetSize()` |
| 7. **Write** standardized NIfTI (`.nii.gz`) + append to `nifti_mappings.json` |
| |
| ### Key Shared Components |
| |
| **`util.py`** (copied into each dataset directory β not a shared import): |
| - `meta_data` class β validates metadata against `config_format.json` schema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionaries |
| - `get_unisize_resampler()` β builds a SimpleITK resampler for isotropic spacing; returns `None` if spacing is already isotropic |
| - `clamp_image()` β HU/intensity clamping via `sitk.ClampImageFilter` |
| - `get_synonyms_dict()` / `replace_synonyms()` β canonical mapping for ROI names, tissue labels, modalities, and task types |
| - `load_nifti()`, `load_dicom_images()`, `save_nifti()` β I/O wrappers that embed `FolderPath` metadata in NIfTI headers |
| |
| **`config_format.json`** (per dataset directory): defines the metadata schema β field types, required flags, and allowed option values. |
| |
| ### Output Structure |
| |
| ``` |
| {output_dir}/{patient_id}/{patient_id}.nii.gz # processed image |
| {output_dir}/{patient_id}/{task}/{tissue}.nii.gz # segmentation labels |
| {output_dir}/nifti_mappings.json # metadata keyed by output path |
| {output_dir}/failed_files.json # files that failed processing |
| ``` |
| |
| ### Dataset-Specific Notes |
| |
| - **AbdomenAtlas**: 25-organ segmentation labels stored as individual NIfTI files per organ; also has `combined_labels.nii.gz` (values 0-25) |
| - **BRATS (2019/2020/2021)**: Multi-modal MRI (FLAIR, T1, T1ce, T2) β each modality processed as a separate sub-modality entry |
| - **MnM2/MnMs**: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon) |
| - **OASIS**: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR) |
| - **OAI_ZIB**: Knee MRI with 6-structure segmentation and clinical grading (WOMAC) |
| - **PSMA**: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant |
| |
| ## Important Conventions |
| |
| - Resampling uses the **minimum** of the original spacing values to create isotropic voxels |
| - Labels are resampled with **nearest-neighbor** interpolation; images use **linear** |
| - The `meta_data` class normalizes terminology automatically β e.g., "chest" maps to "thorax", "seg" maps to "segmentation" |
| - `util.py` is duplicated across directories (not shared via import) β changes must be propagated manually |
| - Code comments and docstrings are frequently in Chinese |
| - Log files (`*.log`) in each directory contain processing run history β these can be large (up to 23 MB) |
|
|