Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -25,7 +25,7 @@ tags:
|
|
| 25 |
|
| 26 |
</div>
|
| 27 |
|
| 28 |
-
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellLogo.png" alt="VoxTell Logo"
|
| 29 |
|
| 30 |
## Model Description
|
| 31 |
|
|
@@ -40,6 +40,30 @@ The model is designed for both anatomical and pathological structures across mul
|
|
| 40 |
- **Comprehensive anatomy coverage**: Brain, thorax, abdomen, pelvis, musculoskeletal system, and extremities
|
| 41 |
- **Flexible granularity**: From coarse anatomical labels to fine-grained pathological findings
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
## Architecture
|
| 44 |
|
| 45 |
VoxTell employs a multi-stage vision-language fusion approach:
|
|
@@ -49,7 +73,7 @@ VoxTell employs a multi-stage vision-language fusion approach:
|
|
| 49 |
- **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
|
| 50 |
- **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
|
| 51 |
|
| 52 |
-
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"
|
| 53 |
|
| 54 |
## Intended Use
|
| 55 |
|
|
@@ -66,27 +90,16 @@ VoxTell employs a multi-stage vision-language fusion approach:
|
|
| 66 |
- Real-time emergency medical decision-making
|
| 67 |
- Standalone clinical decision support without human oversight
|
| 68 |
|
| 69 |
-
## Training Data
|
| 70 |
-
|
| 71 |
-
This checkpoint of VoxTell is trained on an **extended version** of the dataset described in the paper:
|
| 72 |
-
|
| 73 |
-
- **190 public 3D medical imaging datasets**.
|
| 74 |
-
- Approximately **68,500 volumetric images**.
|
| 75 |
-
- Brain, head & neck, thorax, abdomen, pelvis, and extremities.
|
| 76 |
-
- Major organs, muscles, vasculature, substructures, pathologies and lesions
|
| 77 |
-
- Multiple imaging modalities (CT, PET, MRI)
|
| 78 |
-
|
| 79 |
-
While the paper reports training on a subset of datasets with dedicated train/test splits, this checkpoint is trained on **all available datasets (train + test) used in the paper**. During training, the corpus of semantic datasets is sampled with a probability of 95%, while the image-text-mask triplets from the instance-focussed dataset are sampled with the remaining 5%. For more information about semantic and instance based datasets see the [paper](https://arxiv.org/abs/2511.11450).
|
| 80 |
-
|
| 81 |
-
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage" width="600"/>
|
| 82 |
-
|
| 83 |
## Performance
|
| 84 |
|
| 85 |
VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
| 88 |
|
| 89 |
- Performance may vary on imaging modalities or anatomical regions underrepresented in training data
|
|
|
|
| 90 |
- Text prompt quality and specificity affects segmentation accuracy
|
| 91 |
- Not validated for direct clinical use without expert review
|
| 92 |
|
|
|
|
| 25 |
|
| 26 |
</div>
|
| 27 |
|
| 28 |
+
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellLogo.png" alt="VoxTell Logo"/>
|
| 29 |
|
| 30 |
## Model Description
|
| 31 |
|
|
|
|
| 40 |
- **Comprehensive anatomy coverage**: Brain, thorax, abdomen, pelvis, musculoskeletal system, and extremities
|
| 41 |
- **Flexible granularity**: From coarse anatomical labels to fine-grained pathological findings
|
| 42 |
|
| 43 |
+
## Versions
|
| 44 |
+
|
| 45 |
+
We release multiple VoxTell versions (continuously updated) to enable both reproducible research and high-performance downstream applications.
|
| 46 |
+
|
| 47 |
+
### **VoxTell v1.1 (Recommended)**
|
| 48 |
+
|
| 49 |
+
- **Info**: This is the current default version
|
| 50 |
+
- **Training Data**: Trained on **all datasets** from the paper and additional sources (190 datasets, ~68,500 volumes)
|
| 51 |
+
- **Split**: Includes the test sets from the paper in the training corpus
|
| 52 |
+
- **Sampling Strategy**:
|
| 53 |
+
- 95% probability: Semantic datasets corpus
|
| 54 |
+
- 5% probability: Image-text-mask triplets from instance-focused datasets
|
| 55 |
+
- **Use Case**: Recommended for general application, inference, and fine-tuning. This version maximizes supervision and concept coverage for stronger general-purpose performance
|
| 56 |
+
|
| 57 |
+
### **VoxTell v1.0 (Deprecated)**
|
| 58 |
+
|
| 59 |
+
- **Info**: This version was used for the experiments in the paper but contains known issues that have been fixed in v1.1. It is **not recommended** for general use.
|
| 60 |
+
- **Training Data**: Trained on 158 datasets (~62,000 volumes)
|
| 61 |
+
- **Split**: Maintains strict train/test separation as described in the [paper](https://arxiv.org/abs/2511.11450)
|
| 62 |
+
- **Use Case**: Reproducibility of the results reported in the paper
|
| 63 |
+
|
| 64 |
+
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>
|
| 65 |
+
|
| 66 |
+
|
| 67 |
## Architecture
|
| 68 |
|
| 69 |
VoxTell employs a multi-stage vision-language fusion approach:
|
|
|
|
| 73 |
- **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
|
| 74 |
- **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
|
| 75 |
|
| 76 |
+
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"/>
|
| 77 |
|
| 78 |
## Intended Use
|
| 79 |
|
|
|
|
| 90 |
- Real-time emergency medical decision-making
|
| 91 |
- Standalone clinical decision support without human oversight
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
## Performance
|
| 94 |
|
| 95 |
VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).
|
| 96 |
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
## Limitations / Known Issues
|
| 100 |
|
| 101 |
- Performance may vary on imaging modalities or anatomical regions underrepresented in training data
|
| 102 |
+
- Prompting structures absent from the image and never seen on this modality (e.g., "liver" in a brain MRI) may lead to undesired results
|
| 103 |
- Text prompt quality and specificity affects segmentation accuracy
|
| 104 |
- Not validated for direct clinical use without expert review
|
| 105 |
|