mrokuss
/

VoxTell

@@ -25,7 +25,7 @@ tags:
 </div>
-<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellLogo.png" alt="VoxTell Logo" width="600"/>
 ## Model Description
@@ -40,6 +40,30 @@ The model is designed for both anatomical and pathological structures across mul
 - **Comprehensive anatomy coverage**: Brain, thorax, abdomen, pelvis, musculoskeletal system, and extremities
 - **Flexible granularity**: From coarse anatomical labels to fine-grained pathological findings
 ## Architecture
 VoxTell employs a multi-stage vision-language fusion approach:
@@ -49,7 +73,7 @@ VoxTell employs a multi-stage vision-language fusion approach:
 - **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
 - **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
-<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram" width="600"/>
 ## Intended Use
@@ -66,27 +90,16 @@ VoxTell employs a multi-stage vision-language fusion approach:
 - Real-time emergency medical decision-making
 - Standalone clinical decision support without human oversight
-## Training Data
-This checkpoint of VoxTell is trained on an **extended version** of the dataset described in the paper:
-- **190 public 3D medical imaging datasets**.
-- Approximately **68,500 volumetric images**.
-- Brain, head & neck, thorax, abdomen, pelvis, and extremities.
-- Major organs, muscles, vasculature, substructures, pathologies and lesions
-- Multiple imaging modalities (CT, PET, MRI)
-While the paper reports training on a subset of datasets with dedicated train/test splits, this checkpoint is trained on **all available datasets (train + test) used in the paper**. During training, the corpus of semantic datasets is sampled with a probability of 95%, while the image-text-mask triplets from the instance-focussed dataset are sampled with the remaining 5%. For more information about semantic and instance based datasets see the [paper](https://arxiv.org/abs/2511.11450).
-<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage" width="600"/>
 ## Performance
 VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).
-## Limitations
 - Performance may vary on imaging modalities or anatomical regions underrepresented in training data
 - Text prompt quality and specificity affects segmentation accuracy
 - Not validated for direct clinical use without expert review

 </div>
+<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellLogo.png" alt="VoxTell Logo"/>
 ## Model Description
 - **Comprehensive anatomy coverage**: Brain, thorax, abdomen, pelvis, musculoskeletal system, and extremities
 - **Flexible granularity**: From coarse anatomical labels to fine-grained pathological findings
+## Versions
+We release multiple VoxTell versions (continuously updated) to enable both reproducible research and high-performance downstream applications.
+### **VoxTell v1.1 (Recommended)**
+- **Info**: This is the current default version
+- **Training Data**: Trained on **all datasets** from the paper and additional sources (190 datasets, ~68,500 volumes)
+- **Split**: Includes the test sets from the paper in the training corpus
+- **Sampling Strategy**:
+  - 95% probability: Semantic datasets corpus
+  - 5% probability: Image-text-mask triplets from instance-focused datasets
+- **Use Case**: Recommended for general application, inference, and fine-tuning. This version maximizes supervision and concept coverage for stronger general-purpose performance
+### **VoxTell v1.0 (Deprecated)**
+- **Info**: This version was used for the experiments in the paper but contains known issues that have been fixed in v1.1. It is **not recommended** for general use.
+- **Training Data**: Trained on 158 datasets (~62,000 volumes)
+- **Split**: Maintains strict train/test separation as described in the [paper](https://arxiv.org/abs/2511.11450)
+- **Use Case**: Reproducibility of the results reported in the paper
+<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>
 ## Architecture
 VoxTell employs a multi-stage vision-language fusion approach:
 - **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
 - **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
+<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"/>
 ## Intended Use
 - Real-time emergency medical decision-making
 - Standalone clinical decision support without human oversight
 ## Performance
 VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).
+## Limitations / Known Issues
 - Performance may vary on imaging modalities or anatomical regions underrepresented in training data
+- Prompting structures absent from the image and never seen on this modality (e.g., "liver" in a brain MRI) may lead to undesired results
 - Text prompt quality and specificity affects segmentation accuracy
 - Not validated for direct clinical use without expert review