File size: 16,636 Bytes
5b81e21
 
 
 
33ea823
5b81e21
 
33ea823
 
5b81e21
 
 
 
 
 
 
 
 
 
 
 
 
a52d3b6
33ea823
0eb222a
33ea823
 
0eb222a
 
 
 
 
 
 
 
 
 
e8c3b95
 
 
0eb222a
 
 
 
 
 
e8c3b95
 
 
0eb222a
 
 
 
 
e8c3b95
 
 
 
0eb222a
e8c3b95
 
 
 
 
 
a52d3b6
0eb222a
a52d3b6
0eb222a
a52d3b6
 
33ea823
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a52d3b6
 
 
 
ecfd185
a52d3b6
 
 
 
 
 
d558e8e
a52d3b6
1c8558e
a52d3b6
 
 
 
 
 
 
 
 
 
 
 
3904353
a52d3b6
 
 
33ea823
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a52d3b6
33ea823
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
language:
- en
- zh
license: apache-2.0
metrics:
- accuracy
pipeline_tag: any-to-any
library_name: transformers
tags:
- medical
- vision-language
- multimodal
- unified-model
- medical-vqa
- text-to-image
- image-to-text
- medical-understanding
- report-generation
- interleaved-multimodal
- modality-transfer
---

<p align="center">
Β <img src="./images/logo.png" width="50px" height="50px"/>
</p>

  <div align="center">
Β  <a href="https://uni-medical.github.io/UniMedVL_Web/" target="_blank">
    <img alt="Project Page" src="https://img.shields.io/badge/🌐_Project-Page-blue" />
Β  </a>
Β  <a href="https://huggingface.co/uni-medical/UniMedVL" target="_blank">
Β  Β  <img alt="Hugging Face Model" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-UniMedVL-ffc107?color=ffc107&logoColor=white" />
Β  </a>
  Β  <a href="https://huggingface.co/datasets/General-Medical-AI/UniMed-5M" target="_blank">
Β  Β  <img alt="Hugging Face Dataset" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-ffc107?color=ffc107&logoColor=white" />
Β  </a>
</div>

<div align="center">
Β  <a href="https://github.com/uni-medical/UniMedVL" target="_blank">
Β  Β  <img alt="GitHub Stars" src="https://img.shields.io/github/stars/uni-medical/UniMedVL?style=social" />
Β  </a>
Β  <a href="https://arxiv.org/abs/2510.15710" target="_blank">
Β  Β  <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2510.15710-b31b1b.svg" />
Β  </a>
</div>

<p align="center">
  <a href="https://github.com/uni-medical/UniMedVL"><b>🌟 Github</b></a> |
Β  <a href="https://huggingface.co/General-Medical-AI/UniMedVL/tree/main"><b>πŸ“₯ Model Download</b></a> |
  <a href="https://huggingface.co/datasets/General-Medical-AI/UniMed-5M"><b>πŸ“š Dataset</b></a> |
Β  <a href="https://arxiv.org/pdf/2510.15710"><b>πŸ“„ Paper Link</b></a> |
  <a href="https://uni-medical.github.io/UniMedVL_Web/"><b>🌐 Project Page</b></a>
</p>

<h1>
<p align="center">
Β  UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis
</p>
</h1>

<p align="center">
A unified medical foundation model enabling both understanding and generation capabilities within a single architecture.
</p>

Β 
<div align="center">
Β  <img src="./assets/teaser.png" width="95%"/>
</div>

## Paper Abstract

Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at this https URL .

## πŸ“š Introduction

We introduce **UniMedVL**, the unified medical foundation model for seamless multimodal understanding and generation. Four key innovations distinguish UniMedVL:

- **Unified Observation-Knowledge-Analysis Architecture:** UniMedVL sets itself apart from prior medical AI models by following a clinically-inspired three-level framework that mirrors how physicians process medical information, enabling both understanding and generation within a single architecture.

- **Versatile Medical Multimodal Capabilities:** UniMedVL supports a broad spectrum of medical tasks, including visual question answering, medical report generation, text-to-medical-image synthesis, cross-modal translation, and virtual staining across 9 imaging modalities.

- **Large-Scale Medical Dataset:** We present UniMed-5M, a comprehensive medical multimodal dataset containing 5.6M+ high-quality samples with three-stage quality verification and expert validation, covering understanding, generation, and interleaved tasks.

- **Superior Performance:** UniMedVL achieves state-of-the-art performance on multiple evaluation datasets, with 75.4% accuracy on SLAKE VQA, 53.5% on PathVQA, and competitive generation quality (96.29 average gFID), setting a new standard in unified medical AI.

<div align="center">
  <img src="images/overview_ver3.png" alt="UniMedVL Architecture" width="100%">
</div>

## πŸ”¬ Methodology

### πŸ“‹ OKA Framework: Observation-Knowledge-Analysis

UniMedVL follows a workflow-guided three-level framework that mirrors how physicians process medical information:

```mermaid
flowchart TD
    A[Observation Level] --> B[Knowledge Level] --> C[Analysis Level]

    A1[UniMed-5M Dataset<br/>5.6M samples<br/>8 imaging modalities] --> A
    A --> A2[Quality Control<br/>Three-stage verification<br/>Expert validation]

    B1[Progressive Curriculum<br/>Foundation β†’ Instruction β†’ Unified] --> B
    B --> B2[Cross-modal Knowledge Fusion<br/>Understanding ↔ Generation]

    C1[Unified Architecture<br/>Dual encoders + MOT] --> C
    C --> C2[Multimodal Outputs<br/>Reports + Images + Annotations]
```

### 🎯 Training Strategy

**Three-Stage Progressive Curriculum Learning:**

1. **πŸ”§ Stage 1 - Foundation Training** (85K steps)
   - Basic medical pattern recognition
   - Visual-language alignment
   - Data ratio: 75% I2T, 25% T2I

2. **πŸ“š Stage 2 - Instruction Tuning** (120K steps)
   - Cross-modal understanding enhancement
   - Medical expertise development
   - Data ratio: 40% I2T, 45% T2I, 10% Interleaved

3. **πŸš€ Stage 3 - Unified Training** (70K steps)
   - Advanced multimodal synthesis
   - Interleaved task mastery
   - Data ratio: 37% I2T, 35% T2I, 25% Interleaved

---

## Model Details

### Model Description

UniMedVL unifies medical multimodal understanding and generation within a single 14B-parameter architecture. The model supports visual question answering, medical report generation, text-to-medical-image synthesis, cross-modal translation, and virtual staining across 9 imaging modalities (CXR, CT, MRI, Ultrasound, Histopathology, Retinal Fundus, OCT, Endoscopy).

- **License:** Apache License 2.0
- **Model Size:** 14B parameters

### Model Sources

- **Repository:** https://github.com/uni-medical/UniMedVL
- **Project Page:** https://uni-medical.github.io/UniMedVL_Web/
- **Paper:**  https://arxiv.org/abs/2510.15710

## Uses

### Direct Use

The model can be directly used for:
- **Medical Visual Question Answering**: Answer clinical questions about medical images
- **Medical Report Generation**: Generate radiology reports from medical images
- **Text-to-Medical-Image Synthesis**: Generate medical images from textual descriptions
- **Cross-Modal Translation**: Convert between different medical imaging modalities
- **Virtual Staining**: Transform H&E images to IHC staining

### Out-of-Scope Use, Cautious!!

- **Clinical Decision Making**: This model is for research purposes only and should NOT be used for actual clinical diagnosis or treatment decisions

## πŸ’¬ Qualitative Results

Here we present some comprehensive visualization results demonstrating UniMedVL's capabilities. **For additional visualization results and comparisons, please see our [Project Page](https://uni-medical.github.io/UniMedVL_Web/).**

<details open>
  <summary>Performance Across Training Stages</summary>
  <div align="center">
    <img src="images/topline_performance.png" alt="Performance Comparison" width="100%">
    <p><em>Comprehensive performance comparison across training stages and modalities</em></p>
  </div>
</details>

<details open>
  <summary>Multimodal Tasks Demonstration</summary>
  <div align="center">
    <img src="images/fig_results_ver2.png" alt="Multimodal Task Results" width="100%">
    <p><em>Comprehensive visualization of UniMedVL's multimodal capabilities across diverse medical tasks</em></p>
  </div>
</details>

<details open>
  <summary>Medical Visual Question Answering</summary>
  <div align="center">
    <img src="images/visual_question_answering.png" alt="Medical VQA Examples" width="60%">
    <p><em>Medical Visual Question Answering examples showing model's diagnostic reasoning capabilities</em></p>
  </div>
</details>

<details open>
  <summary>Medical Report Generation</summary>
  <div align="center">
    <img src="images/reportgeneration.png" alt="Medical Report Generation" width="60%">
    <p><em>Automated medical report generation examples across different imaging modalities</em></p>
  </div>
</details>

<details open>
  <summary>Text-to-Medical-Image Generation</summary>
  <div align="center">
    <img src="images/text2img1.png" alt="Text-to-Image Generation Examples 1" width="60%">
    <p><em>Text-to-medical-image generation results showing high-quality synthesis</em></p>
  </div>
  <div align="center">
    <img src="images/text2img2.png" alt="Text-to-Image Generation Examples 2" width="60%">
    <p><em>Additional text-to-medical-image generation examples across modalities</em></p>
  </div>
</details>

<details open>
  <summary> Medical-Image Generation across 8 modalities </summary>

 

### Chest X-Ray (CXR)
<div align="center">
  <img src="images/cxr.png" alt="Chest X-Ray" width="60%">
</div>

### Computed Tomography (CT)
<div align="center">
  <img src="images/ct.png" alt="CT Scan" width="60%">
</div>

### Magnetic Resonance Imaging (MRI)
<div align="center">
  <img src="images/mri.png" alt="MRI Scan" width="60%">
</div>

### Ultrasound
<div align="center">
  <img src="images/ultrasound.png" alt="Ultrasound" width="60%">
</div>

### Histopathology (HIS)
<div align="center">
  <img src="images/his.png" alt="Histopathology" width="60%">
</div>

### Retinal Fundus Photography (CFP)
<div align="center">
  <img src="images/retinal.png" alt="Retinal Fundus" width="60%">
</div>

### Optical Coherence Tomography (OCT)
<div align="center">
  <img src="images/oct.png" alt="OCT" width="60%">
</div>

### Endoscopy
<div align="center">
  <img src="images/endoscopy.png" alt="Endoscopy" width="60%">
</div>

</details>

## πŸ“Š Quantitative Performance

<details open>
  <summary>Medical Visual Question Answering Performance</summary>

| Model | Params | Type | VQA-RAD | SLAKE | PathVQA | OmniMedVQA | GMAI-MMBench |
|-------|--------|------|---------|-------|---------|------------|--------------|
| GMAI-VL | 7B | Medical-specific | 66.3 | 72.9 | 39.8 | 88.5 | 61.74 |
| HuatuoGPT-Vision | 7B | Medical-specific | 53.0 | 49.1 | 32.0 | 50.0 | 50.22 |
| Bagel | 7B | Unified | 60.09 | 58.91 | 39.05 | 71.13 | 48.11 |
| HealthGPT-L14 | 14B | Unified | 58.3 | 64.5 | 44.4 | 74.4 | 43.1 |
| **UniMedVL** | **14B** | **Unified** | **61.9** | **75.4** | **53.5** | **85.8** | **60.75** |

</details>


<details open>
  <summary>Medical Image Generation Performance</summary>

*Text-to-image generation performance across 8 medical imaging modalities. Metrics: gFID ↓ (lower is better) / BioMedCLIP Score ↑ (higher is better)*

| Model | CFP | CXR | CT | HIS | MRI | OCT | Ultrasound | Endoscopy | Average |
|-------|-----|-----|----|----|-----|-----|------------|-----------|---------|
| Bagel (7B) | 217.19/0.650 | 182.80/0.662 | 163.78/0.652 | 206.18/0.643 | 175.74/0.639 | 307.80/0.719 | 255.78/0.672 | 214.61/0.668 | 215.49/0.660 |
| **UniMedVL (14B)** | **53.20/0.708** | **73.04/0.702** | **73.04/0.696** | **149.01/0.704** | **90.36/0.706** | **99.27/0.721** | **95.38/0.706** | **133.11/0.707** | **96.29/0.706** |

</details>

<details open>
  <summary>Interleaved Multimodal Tasks Performance</summary>

**Virtual Immunohistochemistry Staining (H&E β†’ IHC)**

| Method | Type | PSNR ↑ | SSIM ↑ |
|--------|------|--------|--------|
| Pyramid Pix2pix | Specialized | 21.16 | 0.477 |
| HealthGPT-M3 | Unified | 15.81 | 0.242 |
| **UniMedVL** | **Unified** | **20.27** | **0.456** |

**MRI Super-Resolution (4Γ— upsampling)**

| Method | Type | PSNR ↑ | SSIM ↑ |
|--------|------|--------|--------|
| AMIR | Specialized | 31.99 | 0.939 |
| HealthGPT-M3 | Unified | 18.37 | 0.580 |
| **UniMedVL** | **Unified** | **27.29** | **0.890** |

**Cross-Modal Synthesis (T2 ↔ FLAIR MRI)**

| Method | Type | Average PSNR ↑ | Average SSIM ↑ |
|--------|------|----------------|----------------|
| ResViT | Specialized | 25.38 | 0.889 |
| HealthGPT-M3 | Unified | 19.09 | 0.748 |
| **UniMedVL** | **Unified** | **25.07** | **0.882** |

</details>

<details open>
  <summary>Counterfactual Medical Image Generation</summary>

*Performance on counterfactual chest X-ray generation with explanatory text. † indicates unified fine-tuning variant.*

| Method | gFID ↓ | AUROC ↑ | F1 ↑ | BLEU-3 ↑ | METEOR ↑ | ROUGE-L ↑ |
|--------|--------|---------|------|----------|----------|-----------|
| ProgEmu | 29.21 | 0.792 | 0.891 | 0.124 | 0.410 | 0.261 |
| **UniMedVL†** | **27.17** | **0.797** | **0.873** | **0.264** | **0.449** | **0.465** |

</details>

---

## πŸ“ Open-Source Plan

- [x] **πŸ“„ Paper & Evaluations** - Research documentation and evaluation results
- [x] **πŸ–ΌοΈ Visualizations** - Result figures and model demonstrations
- [x] **πŸ’Ύ Model Checkpoints** - Pre-trained UniMedVL weights (14B parameters)
- [x] **πŸ”§ Inference Code** - Model loading and inference examples
- [ ] **πŸ‹οΈ Training Code** - Full training pipeline and configuration files
- [ ] **πŸ“ UniMed-5M Dataset** - Training dataset with quality control

## πŸš€ Getting Started

### Installation
```bash
conda env create -f codes/environment.yaml
conda activate unimedvl
```

### Inference Scripts
Two interactive inference scripts are provided in the `codes/` directory:

1. **Medical Visual Question Answering** (`interactive_vqa_inferencer.py`)

2. **Medical Image Generation** (`interactive_image_generator.py`)

### Quick Usage
1. Download the UniMedVL checkpoint 
2. Set `model_path` and `ROOT` in the script configuration
3. Run the script: `python codes/interactive_vqa_inferencer.py` or `python codes/interactive_image_generator.py`

---

## πŸ“œ License

This project is licensed under the **Apache License 2.0**. See the [LICENSE](LICENSE) file for details.

---

## πŸ“š Citations

If you use this project in your research or work, please cite it as:

```bibtex
 @misc{ning2025unimedvlunifyingmedicalmultimodal,
      title={Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis}, 
      author={Junzhi Ning and Wei Li and Cheng Tang and Jiashi Lin and Chenglong Ma and Chaoyang Zhang and Jiyao Liu and Ying Chen and Shujian Gao and Lihao Liu and Yuandong Pu and Huihui Xu and Chenhui Gou and Ziyan Huang and Yi Xin and Qi Qin and Zhongying Deng and Diping Song and Bin Fu and Guang Yang and Yuanfeng Ji and Tianbin Li and Yanzhou Su and Jin Ye and Shixiang Tang and Ming Hu and Junjun He},
      year={2025},
      eprint={2510.15710},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.15710}, 
```

---

## πŸ™ Acknowledgments

We sincerely thank the following projects and their contributors for their invaluable open-source contributions that made this research possible:

- **[Bagel](https://github.com/ByteDance-Seed/Bagel)** - Foundation model architecture and training methodology inspiration
- **[HealthGPT](https://github.com/DCDmllm/HealthGPT)** - Medical domain adaptation and evaluation framework
- **[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)** - Comprehensive evaluation toolkit for vision-language models