Data_Engineering / CLAUDE.md
maxmo2009's picture
Initial upload: data cleanup pipeline for 12 medical imaging datasets
da9fb1e verified
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.).
## Running Data Cleaning Scripts
Each dataset has its own `dataclean_*.py` script. Run from the dataset's subdirectory:
```bash
python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output
```
All scripts follow the same `--target_path` / `--output_dir` argument pattern. Versioned scripts (e.g., `_v2.py`, `_v3.py`) represent iterative improvements; use the highest version unless investigating regressions.
## Dependencies
Python 3 with: `SimpleITK`, `pandas`, `numpy`, `tqdm`, `openpyxl` (for Excel metadata). No requirements.txt exists β€” install manually.
## Architecture
### Processing Pipeline (per dataset)
1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, or NRRD)
2. **Extract metadata** from headers, CSV files, or DICOM tags
3. **Resample** to isotropic spacing using minimum voxel spacing (`get_unisize_resampler`)
4. **Clamp intensities** β€” CT: `[-300, 300]` HU; MRI: varies per dataset
5. **Process segmentation labels** with identical resampling (nearest-neighbor interpolation)
6. **Validate** image/label dimension alignment via `assert` on `GetSize()`
7. **Write** standardized NIfTI (`.nii.gz`) + append to `nifti_mappings.json`
### Key Shared Components
**`util.py`** (copied into each dataset directory β€” not a shared import):
- `meta_data` class β€” validates metadata against `config_format.json` schema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionaries
- `get_unisize_resampler()` β€” builds a SimpleITK resampler for isotropic spacing; returns `None` if spacing is already isotropic
- `clamp_image()` β€” HU/intensity clamping via `sitk.ClampImageFilter`
- `get_synonyms_dict()` / `replace_synonyms()` β€” canonical mapping for ROI names, tissue labels, modalities, and task types
- `load_nifti()`, `load_dicom_images()`, `save_nifti()` β€” I/O wrappers that embed `FolderPath` metadata in NIfTI headers
**`config_format.json`** (per dataset directory): defines the metadata schema β€” field types, required flags, and allowed option values.
### Output Structure
```
{output_dir}/{patient_id}/{patient_id}.nii.gz # processed image
{output_dir}/{patient_id}/{task}/{tissue}.nii.gz # segmentation labels
{output_dir}/nifti_mappings.json # metadata keyed by output path
{output_dir}/failed_files.json # files that failed processing
```
### Dataset-Specific Notes
- **AbdomenAtlas**: 25-organ segmentation labels stored as individual NIfTI files per organ; also has `combined_labels.nii.gz` (values 0-25)
- **BRATS (2019/2020/2021)**: Multi-modal MRI (FLAIR, T1, T1ce, T2) β€” each modality processed as a separate sub-modality entry
- **MnM2/MnMs**: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon)
- **OASIS**: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR)
- **OAI_ZIB**: Knee MRI with 6-structure segmentation and clinical grading (WOMAC)
- **PSMA**: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant
## Important Conventions
- Resampling uses the **minimum** of the original spacing values to create isotropic voxels
- Labels are resampled with **nearest-neighbor** interpolation; images use **linear**
- The `meta_data` class normalizes terminology automatically β€” e.g., "chest" maps to "thorax", "seg" maps to "segmentation"
- `util.py` is duplicated across directories (not shared via import) β€” changes must be propagated manually
- Code comments and docstrings are frequently in Chinese
- Log files (`*.log`) in each directory contain processing run history β€” these can be large (up to 23 MB)