File size: 4,135 Bytes
da9fb1e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | # CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.).
## Running Data Cleaning Scripts
Each dataset has its own `dataclean_*.py` script. Run from the dataset's subdirectory:
```bash
python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output
```
All scripts follow the same `--target_path` / `--output_dir` argument pattern. Versioned scripts (e.g., `_v2.py`, `_v3.py`) represent iterative improvements; use the highest version unless investigating regressions.
## Dependencies
Python 3 with: `SimpleITK`, `pandas`, `numpy`, `tqdm`, `openpyxl` (for Excel metadata). No requirements.txt exists β install manually.
## Architecture
### Processing Pipeline (per dataset)
1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, or NRRD)
2. **Extract metadata** from headers, CSV files, or DICOM tags
3. **Resample** to isotropic spacing using minimum voxel spacing (`get_unisize_resampler`)
4. **Clamp intensities** β CT: `[-300, 300]` HU; MRI: varies per dataset
5. **Process segmentation labels** with identical resampling (nearest-neighbor interpolation)
6. **Validate** image/label dimension alignment via `assert` on `GetSize()`
7. **Write** standardized NIfTI (`.nii.gz`) + append to `nifti_mappings.json`
### Key Shared Components
**`util.py`** (copied into each dataset directory β not a shared import):
- `meta_data` class β validates metadata against `config_format.json` schema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionaries
- `get_unisize_resampler()` β builds a SimpleITK resampler for isotropic spacing; returns `None` if spacing is already isotropic
- `clamp_image()` β HU/intensity clamping via `sitk.ClampImageFilter`
- `get_synonyms_dict()` / `replace_synonyms()` β canonical mapping for ROI names, tissue labels, modalities, and task types
- `load_nifti()`, `load_dicom_images()`, `save_nifti()` β I/O wrappers that embed `FolderPath` metadata in NIfTI headers
**`config_format.json`** (per dataset directory): defines the metadata schema β field types, required flags, and allowed option values.
### Output Structure
```
{output_dir}/{patient_id}/{patient_id}.nii.gz # processed image
{output_dir}/{patient_id}/{task}/{tissue}.nii.gz # segmentation labels
{output_dir}/nifti_mappings.json # metadata keyed by output path
{output_dir}/failed_files.json # files that failed processing
```
### Dataset-Specific Notes
- **AbdomenAtlas**: 25-organ segmentation labels stored as individual NIfTI files per organ; also has `combined_labels.nii.gz` (values 0-25)
- **BRATS (2019/2020/2021)**: Multi-modal MRI (FLAIR, T1, T1ce, T2) β each modality processed as a separate sub-modality entry
- **MnM2/MnMs**: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon)
- **OASIS**: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR)
- **OAI_ZIB**: Knee MRI with 6-structure segmentation and clinical grading (WOMAC)
- **PSMA**: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant
## Important Conventions
- Resampling uses the **minimum** of the original spacing values to create isotropic voxels
- Labels are resampled with **nearest-neighbor** interpolation; images use **linear**
- The `meta_data` class normalizes terminology automatically β e.g., "chest" maps to "thorax", "seg" maps to "segmentation"
- `util.py` is duplicated across directories (not shared via import) β changes must be propagated manually
- Code comments and docstrings are frequently in Chinese
- Log files (`*.log`) in each directory contain processing run history β these can be large (up to 23 MB)
|