README.md · AtlasAnalyticsLab/MOOZY at main

MOOZY / README.md

yousefkotp

docs: update citation to arXiv @misc format

192e027 verified about 12 hours ago

preview code

raw

history blame contribute delete

9.84 kB

	---
	license: cc-by-nc-sa-4.0
	library_name: moozy
	pipeline_tag: feature-extraction
	base_model: 1aurent/vit_small_patch8_224.lunit_dino
	tags:
	- pathology
	- computational-pathology
	- digital-pathology
	- foundation-model
	- whole-slide-image
	- vision-transformer
	- self-supervised-learning
	- slide-encoder
	- case-encoder
	- histopathology
	- medical-imaging
	- multiple-instance-learning
	- slide-level-representation
	- patient-level-representation
	- multi-task-learning
	- survival-analysis
	- cancer
	- oncology
	- tissue-classification
	- mutation-prediction
	- TCGA
	- CPTAC
	- pytorch
	- transformer
	datasets:
	- MahmoodLab/Patho-Bench
	metrics:
	- f1
	- roc_auc
	- accuracy
	language:
	- en
	model-index:
	- name: MOOZY
	results:
	- task:
	type: image-classification
	name: Residual Cancer Burden Classification
	dataset:
	type: bc_therapy
	name: BC Therapy
	metrics:
	- type: f1
	value: 0.56
	name: Weighted F1
	- type: roc_auc
	value: 0.74
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.51
	name: Balanced Accuracy
	- task:
	type: image-classification
	name: TP53 Mutation Prediction
	dataset:
	type: cptac_brca
	name: CPTAC-BRCA
	metrics:
	- type: f1
	value: 0.87
	name: Weighted F1
	- type: roc_auc
	value: 0.86
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.86
	name: Balanced Accuracy
	- task:
	type: image-classification
	name: BAP1 Mutation Prediction
	dataset:
	type: cptac_ccrcc
	name: CPTAC-CCRCC
	metrics:
	- type: f1
	value: 0.89
	name: Weighted F1
	- type: roc_auc
	value: 0.79
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.78
	name: Balanced Accuracy
	- task:
	type: image-classification
	name: ACVR2A Mutation Prediction
	dataset:
	type: cptac_coad
	name: CPTAC-COAD
	metrics:
	- type: f1
	value: 0.91
	name: Weighted F1
	- type: roc_auc
	value: 0.91
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.90
	name: Balanced Accuracy
	- task:
	type: image-classification
	name: Histologic Grade Classification
	dataset:
	type: cptac_lscc
	name: CPTAC-LSCC
	metrics:
	- type: f1
	value: 0.78
	name: Weighted F1
	- type: roc_auc
	value: 0.75
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.77
	name: Balanced Accuracy
	- task:
	type: image-classification
	name: KRAS Mutation Prediction
	dataset:
	type: cptac_luad
	name: CPTAC-LUAD
	metrics:
	- type: f1
	value: 0.85
	name: Weighted F1
	- type: roc_auc
	value: 0.80
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.79
	name: Balanced Accuracy
	- task:
	type: image-classification
	name: IDH Status Classification
	dataset:
	type: ebrains
	name: EBRAINS
	metrics:
	- type: f1
	value: 0.97
	name: Weighted F1
	- type: roc_auc
	value: 0.99
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.97
	name: Balanced Accuracy
	- task:
	type: image-classification
	name: Treatment Response Prediction
	dataset:
	type: mbc
	name: MBC
	metrics:
	- type: f1
	value: 0.58
	name: Weighted F1
	- type: roc_auc
	value: 0.68
	name: Weighted ROC-AUC
	- type: accuracy
	value: 0.48
	name: Balanced Accuracy
	---

	# MOOZY: A Patient-First Foundation Model for Computational Pathology

	<p align="center">
	<a href="https://atlasanalyticslab.github.io/MOOZY/"><img src="https://img.shields.io/badge/Project-Page-4285F4?logo=googlechrome&logoColor=white" alt="Project Page"></a>
	<a href="https://arxiv.org/abs/2603.27048"><img src="https://img.shields.io/badge/arXiv-2603.27048-B31B1B?logo=arxiv" alt="arXiv"></a>
	<a href="https://github.com/AtlasAnalyticsLab/MOOZY"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"></a>
	<!-- TODO: update PyPI badge once first version is published -->
	<a href="https://pypi.org/project/moozy/"><img src="https://img.shields.io/pypi/v/moozy?logo=pypi&logoColor=white&label=PyPI" alt="PyPI"></a>
	<a href="https://github.com/AtlasAnalyticsLab/MOOZY/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey" alt="License"></a>
	<a href="https://www.python.org/"><img src="https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white" alt="Python 3.10+"></a>
	</p>

	MOOZY is a slide and patient-level foundation model for computational pathology. The patient case, not the individual slide, is the core unit of representation. A vision-only slide encoder pretrained with masked self-distillation on 77,134 public slides is aligned with clinical semantics through multi-task supervision over 333 tasks (205 classification, 128 survival) from 56 public datasets spanning 23 anatomical sites. A case transformer explicitly models dependencies across all slides from the same patient, replacing the naive early/late fusion used by prior methods. 85.77M total parameters. Trained entirely on public data.

	![MOOZY data scale](assets/data_scale_overview.png)

	## Table of Contents

	- [Installation](#installation)
	- [Usage](#usage)
	- [From pre-computed H5 feature files](#from-pre-computed-h5-feature-files)
	- [From raw whole-slide images](#from-raw-whole-slide-images)
	- [Python API](#python-api)
	- [Arguments](#arguments)
	- [Output format](#output-format)
	- [Architecture](#architecture)
	- [Tasks](#tasks)
	- [Citation](#citation)
	- [License](#license)

	## Installation

	```bash
	pip install moozy
	```

	The checkpoint and task definitions are downloaded automatically from this repository on first use.

	## Usage

	### From pre-computed H5 feature files

	The faster path. Pass `.h5` files containing patch features extracted with `lunit_vit_small_patch8_dino` at 224x224 patch size. Compatible with [AtlasPatch](https://github.com/AtlasAnalyticsLab/AtlasPatch) and [TRIDENT](https://github.com/mahmoodlab/TRIDENT) outputs.

	```bash
	moozy encode slide_1.h5 slide_2.h5 --output case_embedding.h5
	```

	### From raw whole-slide images

	Pass slide files directly (`.svs`, `.tiff`, `.ndpi`, `.mrxs`, etc.). MOOZY calls [AtlasPatch](https://github.com/AtlasAnalyticsLab/AtlasPatch) under the hood to segment tissue, extract patches, and compute features. Requires `atlas-patch`, `sam2`, and the OpenSlide system library (see the [AtlasPatch installation guide](https://github.com/AtlasAnalyticsLab/AtlasPatch#installation)).

	```bash
	moozy encode slide_1.svs slide_2.svs --output case_embedding.h5 --target_mag 20
	```

	### Python API

	```python
	from moozy.encoding import run_encoding

	# From H5 feature files
	run_encoding(
	slide_paths=["slide_1.h5", "slide_2.h5"],
	output_path="case_embedding.h5",
	)

	# From raw slides
	run_encoding(
	slide_paths=["slide_1.svs", "slide_2.svs"],
	output_path="case_embedding.h5",
	target_mag=20,
	)
	```

	### Arguments

	\| Argument \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `SLIDES` \| (required) \| One or more H5 feature files or raw slide files forming a single case. Cannot mix the two types. \|
	\| `--output`, `-o` \| (required) \| Output H5 file path. \|
	\| `--mixed_precision` \| off \| Enable bfloat16 mixed precision. \|
	\| `--target_mag` \| 20 \| Magnification for patch extraction from raw slides. Ignored for H5. \|
	\| `--step_size` \| 224 \| Stride between patch centers in pixels. Set < 224 for overlap. Ignored for H5. \|
	\| `--mpp_csv` \| - \| CSV with `wsi,mpp` columns for microns-per-pixel overrides. Ignored for H5. \|

	### Output format

	The output H5 file contains a `features` dataset (768-D float32 case embedding) and a `coords` dataset with slide metadata.

	## Architecture

	\| Component \| Architecture \| Params \| Output dim \|
	\|-----------\|-------------\|--------\|------------\|
	\| Patch encoder \| ViT-S/8 (Lunit DINO) \| 21.67M \| 384 \|
	\| Slide encoder \| ViT, 6 layers, 768-D, 12 heads, 2D ALiBi \| 42.8M \| 768 \|
	\| Case transformer \| 3 layers, 12 heads \| 21.3M \| 768 \|

	## Tasks

	This repository includes 333 task definitions in the `tasks/` directory. Each task has a `config.yaml` (task type, organ, label mapping) and a `task.csv` (annotations and splits). The tasks cover 205 classification and 128 survival endpoints across all 32 TCGA cohorts, all 10 CPTAC cohorts, REG, BC-Therapy, BRACS, CAMELYON17, DHMC Kidney, DHMC LUAD, EBRAINS, IMP Colorectum, IMP Cervix, MBC, MUT-HET-RCC, NADT Prostate, NAT-BRCA, and PANDA.

	## Citation

	```bibtex
	@misc{kotp2026moozypatientfirstfoundationmodel,
	title={MOOZY: A Patient-First Foundation Model for Computational Pathology},
	author={Yousef Kotp and Vincent Quoc-Huy Trinh and Christopher Pal and Mahdi S. Hosseini},
	year={2026},
	eprint={2603.27048},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2603.27048},
	}
	```

	## License

	[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Research and non-commercial use only.