mrokuss commited on
Commit
ebc0d63
·
verified ·
1 Parent(s): ac6e562

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +30 -17
README.md CHANGED
@@ -25,7 +25,7 @@ tags:
25
 
26
  </div>
27
 
28
- <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellLogo.png" alt="VoxTell Logo" width="600"/>
29
 
30
  ## Model Description
31
 
@@ -40,6 +40,30 @@ The model is designed for both anatomical and pathological structures across mul
40
  - **Comprehensive anatomy coverage**: Brain, thorax, abdomen, pelvis, musculoskeletal system, and extremities
41
  - **Flexible granularity**: From coarse anatomical labels to fine-grained pathological findings
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Architecture
44
 
45
  VoxTell employs a multi-stage vision-language fusion approach:
@@ -49,7 +73,7 @@ VoxTell employs a multi-stage vision-language fusion approach:
49
  - **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
50
  - **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
51
 
52
- <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram" width="600"/>
53
 
54
  ## Intended Use
55
 
@@ -66,27 +90,16 @@ VoxTell employs a multi-stage vision-language fusion approach:
66
  - Real-time emergency medical decision-making
67
  - Standalone clinical decision support without human oversight
68
 
69
- ## Training Data
70
-
71
- This checkpoint of VoxTell is trained on an **extended version** of the dataset described in the paper:
72
-
73
- - **190 public 3D medical imaging datasets**.
74
- - Approximately **68,500 volumetric images**.
75
- - Brain, head & neck, thorax, abdomen, pelvis, and extremities.
76
- - Major organs, muscles, vasculature, substructures, pathologies and lesions
77
- - Multiple imaging modalities (CT, PET, MRI)
78
-
79
- While the paper reports training on a subset of datasets with dedicated train/test splits, this checkpoint is trained on **all available datasets (train + test) used in the paper**. During training, the corpus of semantic datasets is sampled with a probability of 95%, while the image-text-mask triplets from the instance-focussed dataset are sampled with the remaining 5%. For more information about semantic and instance based datasets see the [paper](https://arxiv.org/abs/2511.11450).
80
-
81
- <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage" width="600"/>
82
-
83
  ## Performance
84
 
85
  VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).
86
 
87
- ## Limitations
 
 
88
 
89
  - Performance may vary on imaging modalities or anatomical regions underrepresented in training data
 
90
  - Text prompt quality and specificity affects segmentation accuracy
91
  - Not validated for direct clinical use without expert review
92
 
 
25
 
26
  </div>
27
 
28
+ <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellLogo.png" alt="VoxTell Logo"/>
29
 
30
  ## Model Description
31
 
 
40
  - **Comprehensive anatomy coverage**: Brain, thorax, abdomen, pelvis, musculoskeletal system, and extremities
41
  - **Flexible granularity**: From coarse anatomical labels to fine-grained pathological findings
42
 
43
+ ## Versions
44
+
45
+ We release multiple VoxTell versions (continuously updated) to enable both reproducible research and high-performance downstream applications.
46
+
47
+ ### **VoxTell v1.1 (Recommended)**
48
+
49
+ - **Info**: This is the current default version
50
+ - **Training Data**: Trained on **all datasets** from the paper and additional sources (190 datasets, ~68,500 volumes)
51
+ - **Split**: Includes the test sets from the paper in the training corpus
52
+ - **Sampling Strategy**:
53
+ - 95% probability: Semantic datasets corpus
54
+ - 5% probability: Image-text-mask triplets from instance-focused datasets
55
+ - **Use Case**: Recommended for general application, inference, and fine-tuning. This version maximizes supervision and concept coverage for stronger general-purpose performance
56
+
57
+ ### **VoxTell v1.0 (Deprecated)**
58
+
59
+ - **Info**: This version was used for the experiments in the paper but contains known issues that have been fixed in v1.1. It is **not recommended** for general use.
60
+ - **Training Data**: Trained on 158 datasets (~62,000 volumes)
61
+ - **Split**: Maintains strict train/test separation as described in the [paper](https://arxiv.org/abs/2511.11450)
62
+ - **Use Case**: Reproducibility of the results reported in the paper
63
+
64
+ <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>
65
+
66
+
67
  ## Architecture
68
 
69
  VoxTell employs a multi-stage vision-language fusion approach:
 
73
  - **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
74
  - **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
75
 
76
+ <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"/>
77
 
78
  ## Intended Use
79
 
 
90
  - Real-time emergency medical decision-making
91
  - Standalone clinical decision support without human oversight
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## Performance
94
 
95
  VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).
96
 
97
+
98
+
99
+ ## Limitations / Known Issues
100
 
101
  - Performance may vary on imaging modalities or anatomical regions underrepresented in training data
102
+ - Prompting structures absent from the image and never seen on this modality (e.g., "liver" in a brain MRI) may lead to undesired results
103
  - Text prompt quality and specificity affects segmentation accuracy
104
  - Not validated for direct clinical use without expert review
105