File size: 4,135 Bytes
da9fb1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.).

## Running Data Cleaning Scripts

Each dataset has its own `dataclean_*.py` script. Run from the dataset's subdirectory:

```bash
python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output
```

All scripts follow the same `--target_path` / `--output_dir` argument pattern. Versioned scripts (e.g., `_v2.py`, `_v3.py`) represent iterative improvements; use the highest version unless investigating regressions.

## Dependencies

Python 3 with: `SimpleITK`, `pandas`, `numpy`, `tqdm`, `openpyxl` (for Excel metadata). No requirements.txt exists β€” install manually.

## Architecture

### Processing Pipeline (per dataset)

1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, or NRRD)
2. **Extract metadata** from headers, CSV files, or DICOM tags
3. **Resample** to isotropic spacing using minimum voxel spacing (`get_unisize_resampler`)
4. **Clamp intensities** β€” CT: `[-300, 300]` HU; MRI: varies per dataset
5. **Process segmentation labels** with identical resampling (nearest-neighbor interpolation)
6. **Validate** image/label dimension alignment via `assert` on `GetSize()`
7. **Write** standardized NIfTI (`.nii.gz`) + append to `nifti_mappings.json`

### Key Shared Components

**`util.py`** (copied into each dataset directory β€” not a shared import):
- `meta_data` class β€” validates metadata against `config_format.json` schema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionaries
- `get_unisize_resampler()` β€” builds a SimpleITK resampler for isotropic spacing; returns `None` if spacing is already isotropic
- `clamp_image()` β€” HU/intensity clamping via `sitk.ClampImageFilter`
- `get_synonyms_dict()` / `replace_synonyms()` β€” canonical mapping for ROI names, tissue labels, modalities, and task types
- `load_nifti()`, `load_dicom_images()`, `save_nifti()` β€” I/O wrappers that embed `FolderPath` metadata in NIfTI headers

**`config_format.json`** (per dataset directory): defines the metadata schema β€” field types, required flags, and allowed option values.

### Output Structure

```
{output_dir}/{patient_id}/{patient_id}.nii.gz          # processed image
{output_dir}/{patient_id}/{task}/{tissue}.nii.gz        # segmentation labels
{output_dir}/nifti_mappings.json                        # metadata keyed by output path
{output_dir}/failed_files.json                          # files that failed processing
```

### Dataset-Specific Notes

- **AbdomenAtlas**: 25-organ segmentation labels stored as individual NIfTI files per organ; also has `combined_labels.nii.gz` (values 0-25)
- **BRATS (2019/2020/2021)**: Multi-modal MRI (FLAIR, T1, T1ce, T2) β€” each modality processed as a separate sub-modality entry
- **MnM2/MnMs**: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon)
- **OASIS**: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR)
- **OAI_ZIB**: Knee MRI with 6-structure segmentation and clinical grading (WOMAC)
- **PSMA**: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant

## Important Conventions

- Resampling uses the **minimum** of the original spacing values to create isotropic voxels
- Labels are resampled with **nearest-neighbor** interpolation; images use **linear**
- The `meta_data` class normalizes terminology automatically β€” e.g., "chest" maps to "thorax", "seg" maps to "segmentation"
- `util.py` is duplicated across directories (not shared via import) β€” changes must be propagated manually
- Code comments and docstrings are frequently in Chinese
- Log files (`*.log`) in each directory contain processing run history β€” these can be large (up to 23 MB)