maxmo2009

Initial upload: data cleanup pipeline for 12 medical imaging datasets

da9fb1e verified 6 days ago

4.14 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Project Overview

	Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.).

	## Running Data Cleaning Scripts

	Each dataset has its own `dataclean_*.py` script. Run from the dataset's subdirectory:

	```bash
	python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output
	```

	All scripts follow the same `--target_path` / `--output_dir` argument pattern. Versioned scripts (e.g., `_v2.py`, `_v3.py`) represent iterative improvements; use the highest version unless investigating regressions.

	## Dependencies

	Python 3 with: `SimpleITK`, `pandas`, `numpy`, `tqdm`, `openpyxl` (for Excel metadata). No requirements.txt exists — install manually.

	## Architecture

	### Processing Pipeline (per dataset)

	1. Load raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, or NRRD)
	2. Extract metadata from headers, CSV files, or DICOM tags
	3. Resample to isotropic spacing using minimum voxel spacing (`get_unisize_resampler`)
	4. Clamp intensities — CT: `[-300, 300]` HU; MRI: varies per dataset
	5. Process segmentation labels with identical resampling (nearest-neighbor interpolation)
	6. Validate image/label dimension alignment via `assert` on `GetSize()`
	7. Write standardized NIfTI (`.nii.gz`) + append to `nifti_mappings.json`

	### Key Shared Components

	`util.py` (copied into each dataset directory — not a shared import):
	- `meta_data` class — validates metadata against `config_format.json` schema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionaries
	- `get_unisize_resampler()` — builds a SimpleITK resampler for isotropic spacing; returns `None` if spacing is already isotropic
	- `clamp_image()` — HU/intensity clamping via `sitk.ClampImageFilter`
	- `get_synonyms_dict()` / `replace_synonyms()` — canonical mapping for ROI names, tissue labels, modalities, and task types
	- `load_nifti()`, `load_dicom_images()`, `save_nifti()` — I/O wrappers that embed `FolderPath` metadata in NIfTI headers

	`config_format.json` (per dataset directory): defines the metadata schema — field types, required flags, and allowed option values.

	### Output Structure

	```
	{output_dir}/{patient_id}/{patient_id}.nii.gz # processed image
	{output_dir}/{patient_id}/{task}/{tissue}.nii.gz # segmentation labels
	{output_dir}/nifti_mappings.json # metadata keyed by output path
	{output_dir}/failed_files.json # files that failed processing
	```

	### Dataset-Specific Notes

	- AbdomenAtlas: 25-organ segmentation labels stored as individual NIfTI files per organ; also has `combined_labels.nii.gz` (values 0-25)
	- BRATS (2019/2020/2021): Multi-modal MRI (FLAIR, T1, T1ce, T2) — each modality processed as a separate sub-modality entry
	- MnM2/MnMs: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon)
	- OASIS: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR)
	- OAI_ZIB: Knee MRI with 6-structure segmentation and clinical grading (WOMAC)
	- PSMA: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant

	## Important Conventions

	- Resampling uses the minimum of the original spacing values to create isotropic voxels
	- Labels are resampled with nearest-neighbor interpolation; images use linear
	- The `meta_data` class normalizes terminology automatically — e.g., "chest" maps to "thorax", "seg" maps to "segmentation"
	- `util.py` is duplicated across directories (not shared via import) — changes must be propagated manually
	- Code comments and docstrings are frequently in Chinese
	- Log files (`*.log`) in each directory contain processing run history — these can be large (up to 23 MB)