docs: add model card and data scale figure

Browse files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

.gitattributes +1 -0
README.md +282 -0
assets/data_scale_overview.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,282 @@

+---
+license: cc-by-nc-sa-4.0
+library_name: moozy
+pipeline_tag: feature-extraction
+base_model: 1aurent/vit_small_patch8_224.lunit_dino
+tags:
+  - pathology
+  - computational-pathology
+  - digital-pathology
+  - foundation-model
+  - whole-slide-image
+  - vision-transformer
+  - self-supervised-learning
+  - slide-encoder
+  - case-encoder
+  - histopathology
+  - medical-imaging
+  - multiple-instance-learning
+  - slide-level-representation
+  - patient-level-representation
+  - multi-task-learning
+  - survival-analysis
+  - cancer
+  - oncology
+  - tissue-classification
+  - mutation-prediction
+  - TCGA
+  - CPTAC
+  - pytorch
+  - transformer
+datasets:
+  - MahmoodLab/Patho-Bench
+metrics:
+  - f1
+  - roc_auc
+  - accuracy
+language:
+  - en
+model-index:
+  - name: MOOZY
+    results:
+      - task:
+          type: image-classification
+          name: Residual Cancer Burden Classification
+        dataset:
+          type: bc_therapy
+          name: BC Therapy
+        metrics:
+          - type: f1
+            value: 0.56
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.74
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.51
+            name: Balanced Accuracy
+      - task:
+          type: image-classification
+          name: TP53 Mutation Prediction
+        dataset:
+          type: cptac_brca
+          name: CPTAC-BRCA
+        metrics:
+          - type: f1
+            value: 0.87
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.86
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.86
+            name: Balanced Accuracy
+      - task:
+          type: image-classification
+          name: BAP1 Mutation Prediction
+        dataset:
+          type: cptac_ccrcc
+          name: CPTAC-CCRCC
+        metrics:
+          - type: f1
+            value: 0.89
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.79
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.78
+            name: Balanced Accuracy
+      - task:
+          type: image-classification
+          name: ACVR2A Mutation Prediction
+        dataset:
+          type: cptac_coad
+          name: CPTAC-COAD
+        metrics:
+          - type: f1
+            value: 0.91
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.91
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.90
+            name: Balanced Accuracy
+      - task:
+          type: image-classification
+          name: Histologic Grade Classification
+        dataset:
+          type: cptac_lscc
+          name: CPTAC-LSCC
+        metrics:
+          - type: f1
+            value: 0.78
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.75
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.77
+            name: Balanced Accuracy
+      - task:
+          type: image-classification
+          name: KRAS Mutation Prediction
+        dataset:
+          type: cptac_luad
+          name: CPTAC-LUAD
+        metrics:
+          - type: f1
+            value: 0.85
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.80
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.79
+            name: Balanced Accuracy
+      - task:
+          type: image-classification
+          name: IDH Status Classification
+        dataset:
+          type: ebrains
+          name: EBRAINS
+        metrics:
+          - type: f1
+            value: 0.97
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.99
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.97
+            name: Balanced Accuracy
+      - task:
+          type: image-classification
+          name: Treatment Response Prediction
+        dataset:
+          type: mbc
+          name: MBC
+        metrics:
+          - type: f1
+            value: 0.58
+            name: Weighted F1
+          - type: roc_auc
+            value: 0.68
+            name: Weighted ROC-AUC
+          - type: accuracy
+            value: 0.48
+            name: Balanced Accuracy
+---
+# MOOZY: A Patient-First Foundation Model for Computational Pathology
+<p align="center">
+  <a href="https://github.com/AtlasAnalyticsLab/MOOZY"><img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"></a>
+  <a href="https://pypi.org/project/moozy/"><img src="https://img.shields.io/pypi/v/moozy?logo=pypi&logoColor=white&label=PyPI" alt="PyPI"></a>
+  <a href="#citation"><img src="https://img.shields.io/badge/Paper-Coming%20Soon-B31B1B" alt="Paper"></a>
+</p>
+MOOZY is a slide and patient-level foundation model for computational pathology. The patient case, not the individual slide, is the core unit of representation. A vision-only slide encoder pretrained with masked self-distillation on 77,134 public slides is aligned with clinical semantics through multi-task supervision over 333 tasks (205 classification, 128 survival) from 56 public datasets spanning 23 anatomical sites. A case transformer explicitly models dependencies across all slides from the same patient, replacing the naive early/late fusion used by prior methods. 85.77M total parameters. Trained entirely on public data.
+![MOOZY data scale](assets/data_scale_overview.png)
+## Table of Contents
+- [Installation](#installation)
+- [Usage](#usage)
+  - [From pre-computed H5 feature files](#from-pre-computed-h5-feature-files)
+  - [From raw whole-slide images](#from-raw-whole-slide-images)
+  - [Python API](#python-api)
+  - [Arguments](#arguments)
+  - [Output format](#output-format)
+- [Architecture](#architecture)
+- [Tasks](#tasks)
+- [Citation](#citation)
+- [License](#license)
+## Installation
+```bash
+pip install moozy
+```
+The checkpoint and task definitions are downloaded automatically from this repository on first use.
+## Usage
+### From pre-computed H5 feature files
+The faster path. Pass `.h5` files containing patch features extracted with `lunit_vit_small_patch8_dino` at 224x224 patch size. Compatible with [AtlasPatch](https://github.com/AtlasAnalyticsLab/AtlasPatch) and [TRIDENT](https://github.com/mahmoodlab/TRIDENT) outputs.
+```bash
+moozy encode slide_1.h5 slide_2.h5 --output case_embedding.h5
+```
+### From raw whole-slide images
+Pass slide files directly (`.svs`, `.tiff`, `.ndpi`, `.mrxs`, etc.). MOOZY calls [AtlasPatch](https://github.com/AtlasAnalyticsLab/AtlasPatch) under the hood to segment tissue, extract patches, and compute features. Requires `atlas-patch`, `sam2`, and the OpenSlide system library (see the [AtlasPatch installation guide](https://github.com/AtlasAnalyticsLab/AtlasPatch#installation)).
+```bash
+moozy encode slide_1.svs slide_2.svs --output case_embedding.h5 --target_mag 20
+```
+### Python API
+```python
+from moozy.encoding import run_encoding
+# From H5 feature files
+run_encoding(
+    slide_paths=["slide_1.h5", "slide_2.h5"],
+    output_path="case_embedding.h5",
+)
+# From raw slides
+run_encoding(
+    slide_paths=["slide_1.svs", "slide_2.svs"],
+    output_path="case_embedding.h5",
+    target_mag=20,
+)
+```
+### Arguments
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `SLIDES` | (required) | One or more H5 feature files or raw slide files forming a single case. Cannot mix the two types. |
+| `--output`, `-o` | (required) | Output H5 file path. |
+| `--mixed_precision` | off | Enable bfloat16 mixed precision. |
+| `--target_mag` | 20 | Magnification for patch extraction from raw slides. Ignored for H5. |
+| `--step_size` | 224 | Stride between patch centers in pixels. Set < 224 for overlap. Ignored for H5. |
+| `--mpp_csv` | - | CSV with `wsi,mpp` columns for microns-per-pixel overrides. Ignored for H5. |
+### Output format
+The output H5 file contains a `features` dataset (768-D float32 case embedding) and a `coords` dataset with slide metadata.
+## Architecture
+| Component | Architecture | Params | Output dim |
+|-----------|-------------|--------|------------|
+| Patch encoder | ViT-S/8 (Lunit DINO) | 21.67M | 384 |
+| Slide encoder | ViT, 6 layers, 768-D, 12 heads, 2D ALiBi | 42.8M | 768 |
+| Case transformer | 3 layers, 12 heads | 21.3M | 768 |
+## Tasks
+This repository includes 333 task definitions in the `tasks/` directory. Each task has a `config.yaml` (task type, organ, label mapping) and a `task.csv` (annotations and splits). The tasks cover 205 classification and 128 survival endpoints across 32 TCGA cohorts, 14 CPTAC cohorts, the REG dataset, and other public sources.
+## Citation
+```bibtex
+@article{moozy,
+  title  = {MOOZY: A Patient-First Foundation Model for Computational Pathology},
+  author = {TODO},
+  year   = {TODO},
+}
+```
+## License
+[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Research and non-commercial use only.

assets/data_scale_overview.png ADDED Viewed

Git LFS Details

SHA256: b7688172dd32c46cf6c3ceb9fa7860137354eafc79ad0def5bf0b1d799ebe230
Pointer size: 131 Bytes
Size of remote file: 372 kB