Data_Engineering / README.md
maxmo2009's picture
Initial upload: data cleanup pipeline for 12 medical imaging datasets
da9fb1e verified
---
license: mit
tags:
- medical-imaging
- data-engineering
- preprocessing
- nifti
- dicom
- simpleitk
library_name: simpleitk
---
# Data_Engineering β€” Medical Imaging Cleanup Pipeline
Standardize diverse medical imaging datasets (CT, MRI, PET) into a unified **NIfTI** format with consistent JSON metadata. Each subdirectory targets one dataset.
> Companion repo to [`DRDMsig/Omini3D`](https://huggingface.co/DRDMsig/Omini3D) β€” produces the standardized data that OmniMorph trains on.
## Supported Datasets
| Subdirectory | Dataset | Modality |
|---|---|---|
| `AbdomenAtlas/` | AbdomenAtlas | CT |
| `AbdomenCT1k/` | AbdomenCT-1K | CT |
| `brats2019_clean/` | BraTS 2019 | MRI (multi-sequence) |
| `brats2020_clean/` | BraTS 2020 | MRI (multi-sequence) |
| `brats2021_clean/` | BraTS 2021 | MRI (multi-sequence) |
| `kaggle_osic_clean/` | Kaggle OSIC Pulmonary Fibrosis | CT |
| `MnM2_clean/` | M&Ms-2 | Cardiac MRI |
| `MnMs_clean/` | M&Ms | Cardiac MRI |
| `OAISIS_clean/` | OASIS-1 / OASIS-2 | Brain MRI |
| `OAI_ZIB_clean/` | OAI-ZIB (knee) | MRI |
| `PSMA_clean/` | PSMA-FDG PET-CT (longitudinal) | PET + CT |
| `all/` | Cross-dataset utilities (artifact plane removal) | β€” |
Each cleaned dataset writes:
- Resampled & clamped `.nii.gz` images / segmentations
- Per-dataset `nifti_mappings.json`
- `failed_files.json` listing files the cleaner could not process
## Repository Layout
```
<dataset>_clean/
β”œβ”€β”€ dataclean_<dataset>.py # main cleanup script (use highest version: _v2.py, _v3.py, ...)
β”œβ”€β”€ util.py # shared helpers (copied per dir, not imported)
β”œβ”€β”€ config_format.json # metadata schema for `meta_data` validation
└── (optional) sample/, demo/ # tiny example NIfTI files for sanity checks
```
## Usage
```bash
cd AbdomenAtlas/
python dataclean_abdomen_atlas_v2.py \
--target_path /path/to/raw/AbdomenAtlas \
--output_dir /path/to/output/AbdomenAtlas_clean
```
All scripts share the `--target_path` / `--output_dir` interface. Versioned scripts (`_v2.py`, `_v3.py`) supersede older versions; use the highest version unless investigating regressions.
### Pipeline (per dataset)
1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, NRRD).
2. **Extract metadata** from headers, CSV files, or DICOM tags.
3. **Resample** to isotropic spacing (`get_unisize_resampler` in `util.py`).
4. **Clamp intensities** β€” CT: `[-300, 300]` HU; MRI: per-dataset windows.
5. **Process segmentation labels** with identical resampling (nearest-neighbor).
6. **Validate** image/label dimensions agree (`assert image.GetSize() == label.GetSize()`).
7. **Write** standardized `.nii.gz` and append to `nifti_mappings.json`.
### Shared `util.py` API
| Function / class | Purpose |
|---|---|
| `meta_data` | Validates metadata against `config_format.json`; required fields: `Modality`, `OriImg_path`, `Spacing_mm`, `Size`, `Dataset_name`. Normalizes ambiguous terminology via synonym dictionaries. |
| `get_unisize_resampler(image)` | Builds a SimpleITK resampler for isotropic spacing; returns `None` if already isotropic. |
| `clamp_image(image, lo, hi)` | HU/intensity clamping via `sitk.ClampImageFilter`. |
## Dependencies
```bash
pip install SimpleITK pandas numpy tqdm openpyxl
```
(No `requirements.txt` β€” install manually.)
## What's Included / Excluded
- βœ… Cleanup scripts, `util.py`, `config_format.json`, demographic CSVs.
- βœ… A handful of tiny demo / sample `.nii.gz` files in `PSMA_clean/{sample,demo}/`.
- ❌ Raw datasets (download from each dataset's official source).
- ❌ Run logs from prior cleanup runs (`*.log`).
- ❌ Intermediate test outputs (`MnM2_clean/test/`).
## License
MIT β€” see project root.