File size: 6,623 Bytes
9d5c69b
 
5c067a5
 
4ca63c1
5c067a5
9d5c69b
 
380ee03
5c067a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
420de36
5c067a5
420de36
5c067a5
 
dc99a9e
5c067a5
 
420de36
5c067a5
fc4356a
 
 
5c067a5
 
 
 
 
 
 
 
 
 
7490e4f
 
5c067a5
 
 
 
 
 
 
 
 
 
 
420de36
ac6f024
 
 
 
 
 
 
 
 
 
 
420de36
5c067a5
 
 
 
 
 
 
 
 
 
 
 
 
 
ac6f024
5c067a5
 
 
 
420de36
 
 
 
 
 
 
5c067a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
420de36
5c067a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: mit
language:
- en
library_name: colipri
pipeline_tag: zero-shot-image-classification
---

# COLIPRI

<!-- Provide a quick summary of what the model is/does. -->

COLIPRI is a 3D vision&ndash;language transformer model trained to encode chest CT scans and reports.

## Model description

<!-- Provide a longer summary of what this model is. -->

COLIPRI was trained using tens of thousands of chest CT scans and reports, without any annotations, using multiple objectives to learn strong joint representations of 3D images and text.
The procedure is described in detail in our manuscript, [_Comprehensive language-image pre-training for 3D medical image understanding_](https://arxiv.org/abs/2510.15042) (Wald et al. 2026).

The weights shared here correspond to our best-performing model, COLIPRI-CRM.

- **Developed by:** Microsoft Health Futures
- **Model type:** 3D vision&ndash;language encoder
- **License:** [MIT](./LICENSE)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

COLIPRI is shared for research purposes only.
It is **not meant to be used for clinical practice**.

The encoders be plugged to other models, or used independently or jointly for many downstream tasks, such as:

- Image classification with text prompts
- Image clustering
- Text clustering
- Text-to-image retrieval
- Image-to-image retrieval
- Image-to-text retrieval
- Text-to-text retrieval
- Image classification with a classifier
- Text classification with a classifier
- Image segmentation with a decoder
- Report generation with a language decoder

Fine-tuning COLIPRI is typically not necessary to obtain good performance in downstream tasks.

## Getting started

### Installation

```shell
pip install colipri
```

### Usage examples

Below we share some usage snippets to get started with COLIPRI.
A more complete [Jupyter notebook](./COLIPRI_demo.ipynb) is also available.

First, let's get a 3D chest CT we can use for demonstration.
The plotted slices intersect a lung nodule near the heart.

```python
>>> from colipri import load_sample_ct
>>> image = load_sample_ct()
>>> image
ScalarImage(shape: (1, 512, 512, 139); spacing: (0.76, 0.76, 2.50); orientation: LPS+; dtype: torch.IntTensor; memory: 139.0 MiB)
```

The image looks like this:

![Input CT](assets/input.png)

Now, let's instantiate the model and processor.

```python
>>> from colipri import get_model
>>> from colipri import get_processor
>>> model = get_model().cuda()
>>> processor = get_processor()
```

#### Zero-shot classification

```python
>>> from colipri import ZeroShotImageClassificationPipeline
>>> pipeline = ZeroShotImageClassificationPipeline(model, processor)
>>> pipeline(image, ["No lung nodules", "Lung nodules"])
[
    {'score': 0.005, 'label': 'No lung nodules'},
    {'score': 0.995, 'label': 'Lung nodules'}
]
```

#### Feature extraction

```python
>>> import torch
>>> preprocessed_images = processor.process_images(image)
>>> preprocessed_images[0]
ScalarImage(shape: (1, 192, 192, 192); spacing: (2.00, 2.00, 2.00); orientation: SAR+; dtype: torch.FloatTensor; memory: 27.0 MiB)
>>> images_batch = processor.to_images_batch(preprocessed_images)
images_batch.shape
torch.Size([1, 1, 192, 192, 192])
>>> with torch.no_grad():
...     patch_embeddings = model.encode_image(images_batch)
>>> patch_embeddings.shape
torch.Size([1, 768, 24, 24, 24])
>>> with torch.no_grad():
...     pooled_embeddings = model.encode_image(images_batch, pool=True, project=True)
>>> pooled_embeddings.shape
torch.Size([1, 768])
```

## Biases, risks, and limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

COLIPRI was trained with data from Turkey and the USA only, therefore it might be biased towards population in the training data.
Underlying biases of the training datasets may not be well characterized.

## Environmental impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

- **Hardware type:** NVIDIA A100 GPUs
- **Hours used:** 72 hours × 4 GPUs = 288 GPU-hours
- **Cloud provider:** Azure
- **Compute region:** West US 2
- **Carbon emitted:** 21.6 kg CO₂ eq.

### Compute infrastructure

COLIPRI was trained on [Azure Machine Learning](https://azure.microsoft.com/en-us/products/machine-learning).

#### Hardware

| Stage | Node type | Num. nodes | GPU type | GPUs per node |
| --- | --- | --- | --- | --- |
| Pre-training | [`Standard_NC96ads_A100_v4`](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nca100v4-series?tabs=sizeaccelerators) | 1 | NVIDIA A100 (80 GB) | 4 |
| Evaluation | [`Standard_NC24ads_A100_v4`](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nca100v4-series?tabs=sizeaccelerators) | 1 | NVIDIA A100 (80 GB) | 1 |

#### Software

The main software libraries used in this work were [nnSSL](https://github.com/MIC-DKFZ/nnssl) for training, [TorchIO](https://torchio.org/) for preprocessing and augmentation, [`nifti-zarr-py`](https://github.com/neuroscales/nifti-zarr-py) for data loading, and [nnU-Net](https://github.com/MIC-DKFZ/nnUNet) for segmentation evaluation.

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

### BibTeX

```bibtex
@misc{
    wald2026_colipri,
    title={Comprehensive language-image pre-training for 3D medical image understanding},
    author={Tassilo Wald and Ibrahim Ethem Hamamci and Yuan Gao and Sam Bond-Taylor and Harshita Sharma and Maximilian Ilse and Cynthia Lo and Olesya Melnichenko and Anton Schwaighofer and Noel C. F. Codella and Maria Teodora Wetscherek and Klaus H. Maier-Hein and Panagiotis Korfiatis and Valentina Salvatelli and Javier Alvarez-Valle and P{\'e}rez-Garc{\'i}a},
    year={2026},
    eprint={2510.15042},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2510.15042},
}
```

### APA

> Wald, T., Hamamci, I. E., Gao, Y., Bond-Taylor, S., Sharma, H., Ilse, M., Lo, C., Melnichenko, O., Schwaighofer, A., Codella, N. C. F., Wetscherek, M. T., Maier-Hein, K. H., Korfiatis, P., Salvatelli, V., Alvarez-Valle, J., & Pérez-García, F. (2026). Comprehensive language-image pre-training for 3D medical image understanding. arXiv. <https://doi.org/10.48550/ARXIV.2510.15042>

## Model card contact

Fernando Pérez-García ([`fperezgarcia@microsoft.com`](mailto:fperezgarcia@microsoft.com)).