| --- |
| license: mit |
| tags: |
| - medical-imaging |
| - data-engineering |
| - preprocessing |
| - nifti |
| - dicom |
| - simpleitk |
| library_name: simpleitk |
| --- |
| |
| # Data_Engineering β Medical Imaging Cleanup Pipeline |
| |
| Standardize diverse medical imaging datasets (CT, MRI, PET) into a unified **NIfTI** format with consistent JSON metadata. Each subdirectory targets one dataset. |
| |
| > Companion repo to [`DRDMsig/Omini3D`](https://huggingface.co/DRDMsig/Omini3D) β produces the standardized data that OmniMorph trains on. |
| |
| ## Supported Datasets |
| |
| | Subdirectory | Dataset | Modality | |
| |---|---|---| |
| | `AbdomenAtlas/` | AbdomenAtlas | CT | |
| | `AbdomenCT1k/` | AbdomenCT-1K | CT | |
| | `brats2019_clean/` | BraTS 2019 | MRI (multi-sequence) | |
| | `brats2020_clean/` | BraTS 2020 | MRI (multi-sequence) | |
| | `brats2021_clean/` | BraTS 2021 | MRI (multi-sequence) | |
| | `kaggle_osic_clean/` | Kaggle OSIC Pulmonary Fibrosis | CT | |
| | `MnM2_clean/` | M&Ms-2 | Cardiac MRI | |
| | `MnMs_clean/` | M&Ms | Cardiac MRI | |
| | `OAISIS_clean/` | OASIS-1 / OASIS-2 | Brain MRI | |
| | `OAI_ZIB_clean/` | OAI-ZIB (knee) | MRI | |
| | `PSMA_clean/` | PSMA-FDG PET-CT (longitudinal) | PET + CT | |
| | `all/` | Cross-dataset utilities (artifact plane removal) | β | |
|
|
| Each cleaned dataset writes: |
|
|
| - Resampled & clamped `.nii.gz` images / segmentations |
| - Per-dataset `nifti_mappings.json` |
| - `failed_files.json` listing files the cleaner could not process |
|
|
| ## Repository Layout |
|
|
| ``` |
| <dataset>_clean/ |
| βββ dataclean_<dataset>.py # main cleanup script (use highest version: _v2.py, _v3.py, ...) |
| βββ util.py # shared helpers (copied per dir, not imported) |
| βββ config_format.json # metadata schema for `meta_data` validation |
| βββ (optional) sample/, demo/ # tiny example NIfTI files for sanity checks |
| ``` |
|
|
| ## Usage |
|
|
| ```bash |
| cd AbdomenAtlas/ |
| python dataclean_abdomen_atlas_v2.py \ |
| --target_path /path/to/raw/AbdomenAtlas \ |
| --output_dir /path/to/output/AbdomenAtlas_clean |
| ``` |
|
|
| All scripts share the `--target_path` / `--output_dir` interface. Versioned scripts (`_v2.py`, `_v3.py`) supersede older versions; use the highest version unless investigating regressions. |
|
|
| ### Pipeline (per dataset) |
|
|
| 1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, NRRD). |
| 2. **Extract metadata** from headers, CSV files, or DICOM tags. |
| 3. **Resample** to isotropic spacing (`get_unisize_resampler` in `util.py`). |
| 4. **Clamp intensities** β CT: `[-300, 300]` HU; MRI: per-dataset windows. |
| 5. **Process segmentation labels** with identical resampling (nearest-neighbor). |
| 6. **Validate** image/label dimensions agree (`assert image.GetSize() == label.GetSize()`). |
| 7. **Write** standardized `.nii.gz` and append to `nifti_mappings.json`. |
|
|
| ### Shared `util.py` API |
|
|
| | Function / class | Purpose | |
| |---|---| |
| | `meta_data` | Validates metadata against `config_format.json`; required fields: `Modality`, `OriImg_path`, `Spacing_mm`, `Size`, `Dataset_name`. Normalizes ambiguous terminology via synonym dictionaries. | |
| | `get_unisize_resampler(image)` | Builds a SimpleITK resampler for isotropic spacing; returns `None` if already isotropic. | |
| | `clamp_image(image, lo, hi)` | HU/intensity clamping via `sitk.ClampImageFilter`. | |
|
|
| ## Dependencies |
|
|
| ```bash |
| pip install SimpleITK pandas numpy tqdm openpyxl |
| ``` |
|
|
| (No `requirements.txt` β install manually.) |
|
|
| ## What's Included / Excluded |
|
|
| - β
Cleanup scripts, `util.py`, `config_format.json`, demographic CSVs. |
| - β
A handful of tiny demo / sample `.nii.gz` files in `PSMA_clean/{sample,demo}/`. |
| - β Raw datasets (download from each dataset's official source). |
| - β Run logs from prior cleanup runs (`*.log`). |
| - β Intermediate test outputs (`MnM2_clean/test/`). |
|
|
| ## License |
|
|
| MIT β see project root. |
|
|