File size: 9,899 Bytes
b5bbd0b
 
 
 
 
 
 
 
 
 
ac6e562
 
 
 
 
 
 
 
 
 
5a91fce
 
ac6e562
 
5a91fce
ac6e562
 
 
ebc0d63
ac6e562
 
 
 
 
 
 
 
 
 
 
 
 
 
ebc0d63
 
 
 
4d11382
ebc0d63
 
 
 
 
 
 
 
 
4d11382
ebc0d63
 
 
 
 
 
 
 
4d11382
 
 
 
 
 
 
 
 
 
 
 
5a91fce
4d11382
 
 
ebc0d63
5a91fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac6e562
 
 
 
 
 
 
 
 
ebc0d63
ac6e562
 
 
4d11382
ac6e562
4d11382
 
ac6e562
 
4d11382
ac6e562
 
 
4d11382
ac6e562
 
 
 
 
5a91fce
ebc0d63
 
 
ac6e562
 
ebc0d63
ac6e562
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
license: cc-by-nc-sa-4.0
language:
- en
pipeline_tag: text-to-3d
tags:
- Segmentation
- Text
- Prompt
- Medical
- Vision-Language
- CT
- MRI
- PET
- Radiology
---

# VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2511.11450-B31B1B.svg)](https://arxiv.org/abs/2511.11450)&#160;
[![GitHub](https://img.shields.io/badge/GitHub-VoxTell-181717?logo=github&logoColor=white)](https://github.com/MIC-DKFZ/VoxTell)&#160;
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Model-VoxTell-yellow)](https://huggingface.co/MIC-DKFZ/VoxTell)&#160;
[![napari](https://img.shields.io/badge/napari-plugin-80d1ff)](https://github.com/MIC-DKFZ/napari-voxtell)

</div>

<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellLogo.png" alt="VoxTell Logo"/>

## Model Description

VoxTell is a state-of-the-art 3D vision-language segmentation model that directly maps free-form text prompts to volumetric masks in medical imaging. Unlike traditional segmentation models that require predefined class labels, VoxTell accepts natural language descriptions ranging from single words to full clinical sentences, enabling intuitive and flexible medical image analysis.

The model is designed for both anatomical and pathological structures across multiple imaging modalities (CT, PET, MRI), being trained on 1000+ familiar concepts while maintaining strong generalization to related unseen classes through its multi-stage vision-language fusion architecture.

## Key Features

- **Free-text prompting**: Generate 3D segmentation masks using natural language descriptions
- **Multi-modality support**: Works across CT, PET, and MRI imaging modalities
- **Comprehensive anatomy coverage**: Brain, thorax, abdomen, pelvis, musculoskeletal system, and extremities
- **Flexible granularity**: From coarse anatomical labels to fine-grained pathological findings

## Versions

We release multiple VoxTell versions (continuously updated) to enable both reproducible research and high-performance downstream applications.

#### **VoxTell v1.1 (Recommended)**

- **Info**: This is the current default version
- **Training Data**: Trained on **all datasets** from the paper and additional sources (190 datasets, ~68,500 volumes)
- **Split**: Includes the test sets from the paper in the training corpus
- **Sampling Strategy**: 
  - 95% probability: Semantic datasets corpus
  - 5% probability: Image-text-mask triplets from instance-focused datasets
- **Use Case**: Recommended for general application, inference, and fine-tuning. This version maximizes supervision and concept coverage for stronger general-purpose performance

#### **VoxTell v1.0 (Deprecated)**

- **Info**: This version was used for the experiments in the paper but contains known issues that have been fixed in v1.1. It is **not recommended** for general use.
- **Training Data**: Trained on 158 datasets (~62,000 volumes)
- **Split**: Maintains strict train/test separation as described in the [paper](https://arxiv.org/abs/2511.11450)
- **Use Case**: Reproducibility of the results reported in the paper

<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>

## How to Download

You can download VoxTell checkpoints using the Hugging Face `huggingface_hub` library:

```
from huggingface_hub import snapshot_download

MODEL_NAME = "voxtell_v1.1" # Updated models may be available in the future
DOWNLOAD_DIR = "/home/user/temp" # Optionally specify the download directory

download_path = snapshot_download(
repo_id="mrokuss/VoxTell",
allow_patterns=[f"{MODEL_NAME}/*", "*.json"],
local_dir=DOWNLOAD_DIR
)
```

## 🛠 Installation

### 1. Create a Virtual Environment

VoxTell supports Python 3.10+ and works with Conda, pip, or any other virtual environment manager. Here's an example using Conda:

```bash
conda create -n voxtell python=3.12
conda activate voxtell
```

### 2. Install PyTorch

> [!WARNING]
> **Temporary Compatibility Warning**  
> There is a known issue with **PyTorch 2.9.0** causing **OOM errors during inference** in `VoxTell` (related to 3D convolutions — see the PyTorch issue [here](https://github.com/pytorch/pytorch/issues/166122)).  
> **Until this is resolved, please use PyTorch 2.8.0 or earlier.**

Install PyTorch compatible with your CUDA version. For example, for Ubuntu with a modern NVIDIA GPU:

```bash
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu126
```

*For other configurations (macOS, CPU, different CUDA versions), please refer to the [PyTorch Get Started](https://pytorch.org/get-started/previous-versions/) page.*

Install via pip (you can also use [uv](https://docs.astral.sh/uv/)):

```bash
pip install voxtell
```

or install directly from the GitHub repository:

```bash
git clone https://github.com/MIC-DKFZ/VoxTell
cd VoxTell
pip install -e .
```

### 3. Python API

For more control or integration into Python workflows, use the Python API:

```python
import torch
from voxtell.inference.predictor import VoxTellPredictor
from nnunetv2.imageio.nibabel_reader_writer import NibabelIOWithReorient

# Select device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load image
image_path = "/path/to/your/image.nii.gz"
img, _ = NibabelIOWithReorient().read_images([image_path])

# Define text prompts
text_prompts = ["liver", "right kidney", "left kidney", "spleen"]

# Initialize predictor
predictor = VoxTellPredictor(
      model_dir="/path/to/voxtell_model_directory",
      device=device,
)

# Run prediction
# Output shape: (num_prompts, x, y, z)
voxtell_seg = predictor.predict_single_image(img, text_prompts)
```

#### 4. Optional: Visualize Results

You can visualize the segmentation results using [napari](https://napari.org/):

```bash
pip install napari[all]
```

```python
import napari
import numpy as np

# Create a napari viewer and add the original image
viewer = napari.Viewer() 
viewer.add_image(img, name='Image')

# Add segmentation results as label layers for each prompt
for i, prompt in enumerate(text_prompts):
      viewer.add_labels(voxtell_seg[i].astype(np.uint8), name=prompt)

# Run napari
napari.run()
```

## Important: Image Orientation and Spacing

- ⚠️ **Image Orientation (Critical)**: For correct anatomical localization (e.g., distinguishing left from right), images **must be in RAS orientation**. VoxTell was trained on data reoriented using [this specific reader](https://github.com/MIC-DKFZ/nnUNet/blob/86606c53ef9f556d6f024a304b52a48378453641/nnunetv2/imageio/nibabel_reader_writer.py#L101). Orientation mismatches can be a source of error. An easy way to test for this is if a simple prompt like "liver" fails and segments parts of the spleen instead. Make sure your image metadata is correct.

- **Image Spacing**: The model does not resample images to a standardized spacing for faster inference. Performance may degrade on images with very uncommon voxel spacings (e.g., super high-resolution brain MRI). In such cases, consider resampling the image to a more typical clinical spacing (e.g., 1.5×1.5×1.5 mm³) before segmentation.

---

## Architecture

VoxTell employs a multi-stage vision-language fusion approach:

- **Image Encoder**: Processes 3D volumetric input into latent feature representations
- **Prompt Encoder**: We use the fozen [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model to embed text prompts
- **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
- **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision

<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"/>

## Intended Use

#### Primary Use Cases

- Research in vision-language models for medical image analysis
- Text-promptable or automated segmentation of anatomical structures in medical imaging
- Identification and delineation of pathological findings

#### Out-of-Scope Use

- Clinical diagnosis without expert radiologist review
- Real-time emergency medical decision-making
- Commercial use

## Performance

VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).

Tip: Experiment with different prompts tailored to your use case. For example, the prompt `lesions` is known to be overconfident, i.e. over-segmenting, compared to `lesion`.


## Limitations / Known Issues

- Performance may vary on imaging modalities or anatomical regions underrepresented in training data
- Prompting structures absent from the image and never seen on this modality (e.g., "liver" in a brain MRI) may lead to undesired results
- Text prompt quality and specificity affects segmentation accuracy
- Not validated for direct clinical use without expert review

## Citation

```bibtex
@misc{rokuss2025voxtell,
      title={VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation}, 
      author={Maximilian Rokuss and Moritz Langenberg and Yannick Kirchhoff and Fabian Isensee and Benjamin Hamm and Constantin Ulrich and Sebastian Regnery and Lukas Bauer and Efthimios Katsigiannopulos and Tobias Norajitra and Klaus Maier-Hein},
      year={2025},
      eprint={2511.11450},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.11450}, 
}
```

---

## 📬 Contact

For questions, issues, or collaborations, please contact:

📧 maximilian.rokuss@dkfz-heidelberg.de / moritz.langenberg@dkfz-heidelberg.de