| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| base_model: |
| - Qwen/Qwen3-VL-4B-Instruct |
| pipeline_tag: image-text-to-image |
| tags: |
| - medical |
| - multimodal |
| - MLLM |
| --- |
| |
| <h2 align="center"><b>🩺 HealthGPT-Pro: A High-Performance Multimodal Large Language Model for Medical Understanding and Analysis</b></h2> |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/lintw/HealthGPT-Pro-4B" target="_blank">🤗 HealthGPT-Pro-4B</a> |
| |
| <a href="https://huggingface.co/lintw/HealthGPT-Pro-8B" target="_blank">🤗 HealthGPT-Pro-8B</a> |
| |
| <a href="https://modelscope.cn/models/TianweiLin/HealthGPT-Pro-4B" target="_blank">🤖 HealthGPT-Pro-4B (ModelScope)</a> |
| |
| <a href="https://modelscope.cn/models/TianweiLin/HealthGPT-Pro-8B" target="_blank">🤖 HealthGPT-Pro-8B (ModelScope)</a> |
| |
| <a href="https://lin-tianwei.github.io/healthgpt-pro.github.io/" target="_blank">🚀 Project</a> |
| </p> |
|
|
|
|
| ## 🧭 0. Overview |
|
|
| **HealthGPT-Pro** is a SoTA medical multimodal large language model (Med-MLLM) built on Qwen3-VL. It is designed for **medical text**, **2D medical image**, and **3D medical volumes** understanding and analysis, providing strong performance across broad medical text-based and vision-language tasks. |
|
|
| **✨ Core features:** |
|
|
| - **Multimodal input support:** HealthGPT-Pro can process text, 2D images, and 3D volumetric data. |
| - **Efficient training:** HealthGPT-Pro achieves SoTA performance through a **two-stage training recipe,** using **3M** samples for alignment and **10M** samples for SFT. |
| - **Instruction following:** Unlike many medical Med-MLLMs tuned only on medical-domain data, HealthGPT-Pro preserves a substantial proportion of general data to maintain instruction-following ability. |
| - **Comprehensive modality coverage:** **(1)** Computed Tomography **(2)** Digital Photography **(3)** Fundus Photography **(4)** Infrared Reflectance Imaging **(5)** Magnetic Resonance Imaging **(6)** Optical Coherence Tomography **(7)** Dermoscopy **(8)** Endoscopy **(9)** Microscopy **(10)** X-ray Imaging **(11)** Ultrasound Imaging **(12)** Histopathology **(13)** Colposcopy **(14)** Text. |
| - **Comprehensive task coverage:** HealthGPT-Pro is trained on diverse medical and general tasks, with strong instruction-following behavior. |
|
|
| This model is intended for research use. It should not be used as a substitute for professional clinical judgment, diagnosis, or treatment. |
|
|
| ## 📊 1. Performance Comparison |
|
|
| ### 📝 Medical Text Benchmarks |
|
|
| | **Model** | **MMLU-Med** | **MMLU-Pro-Med** | **MMedBench** | **MedBullets** | **MedMCQA** | **MedQA** | **MedXpertQA-Text** | **PubMedQA** | **SuperGPQA-Medical** | **Avg.** | |
| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | Qwen3-VL-4B | 74.3 | 50.7 | 60.5 | 46.4 | 56.0 | 60.5 | 12.6 | 75.6 | 29.6 | 51.8 | |
| | Qwen3-VL-8B | 79.8 | 57.4 | 65.9 | 51.3 | 61.1 | 65.9 | 12.8 | 76.2 | 30.2 | 55.6 | |
| | Lingshu-7B | 75.8 | 53.5 | 64.5 | 57.8 | 56.6 | 64.4 | 16.9 | 76.8 | 29.9 | 55.1 | |
| | HealthGPT-14B | 80.2 | <u>63.4</u> | 63.2 | 39.8 | 63.4 | 66.2 | 11.3 | 68.0 | 25.7 | 53.5 | |
| | HuatuoGPT-V-34B | 74.7 | 51.8 | 60.7 | 42.7 | 54.7 | 58.8 | 11.4 | 54.7 | 26.5 | 48.4 | |
| | Hulu-Med-4B | 78.6 | 58.6 | 66.7 | 59.4 | 64.8 | <u>71.9</u> | 16.8 | 77.6 | 29.5 | 58.2 | |
| | Hulu-Med-7B | 79.5 | 60.6 | **72.8** | **61.5** | <u>67.6</u> | **73.5** | **19.6** | 77.4 | 31.1 | <u>60.4</u> | |
| | **HealthGPT-Pro-4B** | <u>80.4</u> | 58.4 | <u>71.6</u> | 58.0 | 64.4 | 71.5 | 16.2 | <u>78.4</u> | <u>31.4</u> | 58.9 | |
| | **HealthGPT-Pro-8B** | **83.1** | **64.1** | 71.4 | <u>60.6</u> | **68.5** | 71.3 | <u>18.3</u> | **79.2** | **35.4** | **61.3** | |
|
|
| ### 🖼️ Medical Multimodal Benchmarks |
|
|
| | Model | MMMU-Med | VQA-RAD | SLAKE | PathVQA | MedXpertQA-Multimodal | MedFrameQA | OmniMedVQA-Mini | PMC-VQA | M3D-MCQ | CT-RATE-MCQ | AMOS-MM-MCQ | Avg. | |
| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | Qwen3-VL-4B | 44.3 | 59.9 | 77.0 | 53.0 | 13.4 | 40.6 | 74.7 | 53.0 | 57.2 | 58.8 | 49.2 | 52.8 | |
| | Qwen3-VL-8B | 46.5 | 63.4 | 80.2 | 58.3 | 18.7 | 46.4 | 73.0 | 55.6 | 59.5 | 61.6 | 51.2 | 55.9 | |
| | Lingshu-7B | 47.3 | 66.7 | 81.9 | 61.0 | <u>25.5</u> | 52.6 | **82.4** | 57.2 | 64.1 | 68.3 | 62.7 | 60.9 | |
| | HealthGPT-14B | 45.5 | 62.6 | 64.2 | 56.0 | 24.1 | 45.3 | 70.2 | 56.4 | 55.2 | 57.3 | 46.5 | 53.0 | |
| | HuatuoGPT-V-34B | 50.1 | 60.3 | 68.3 | 47.7 | 21.5 | 49.6 | 69.7 | 56.6 | 50.1 | 54.9 | 48.7 | 52.5 | |
| | Hulu-Med-4B | 45.8 | 72.6 | 81.7 | 59.7 | 24.6 | 54.2 | 75.1 | 53.1 | 76.0 | 70.1 | 69.1 | 62.0 | |
| | Hulu-Med-7B | 50.5 | <u>77.2</u> | **85.8** | 64.2 | **28.3** | 57.4 | 77.7 | 57.3 | 80.4 | 76.2 | 70.5 | 66.0 | |
| | **HealthGPT-Pro-4B** | <u>52.0</u> | 76.6 | 83.9 | <u>66.7</u> | 20.8 | <u>61.4</u> | 78.2 | <u>60.0</u> | <u>81.0</u> | **86.2** | <u>71.1</u> | <u>67.1</u> | |
| | **HealthGPT-Pro-8B** | **54.7** | **78.4** | <u>85.0</u> | **70.7** | 25.3 | **63.6** | <u>80.2</u> | **61.1** | **81.6** | <u>86.0</u> | **72.2** | **69.0** | |
|
|
| ## ⚙️ 2. Environment Setup |
|
|
| The recommended environment is provided as follows: |
|
|
| ```bash |
| # Create and activate a clean Python 3.12 environment |
| conda create -n healthgpt-pro python=3.12 -y |
| conda activate healthgpt-pro |
| |
| # Install PyTorch with CUDA support |
| # If your CUDA version is lower than 12.8, install a matching PyTorch build instead (e.g., cu121 or cu118). |
| pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 |
| |
| # Install FlashAttention for faster attention |
| pip install flash-attn==2.8.3 --no-build-isolation --upgrade |
| |
| # Install other dependencies |
| pip install transformers==4.57.1 accelerate==1.11.0 deepspeed==0.16.9 numpy==1.26.4 peft==0.17.1 |
| pip install qwen-vl-utils pillow |
| ``` |
|
|
| ## 🚀 3. Inference |
|
|
| ### 🧩 Load Model and Processor |
|
|
| ```python |
| import numpy as np |
| import torch |
| from PIL import Image |
| from transformers import AutoProcessor, Qwen3VLForConditionalGeneration |
| from qwen_vl_utils import process_vision_info |
| |
| model_id = "HealthGPT-Pro-4B" |
| |
| model = Qwen3VLForConditionalGeneration.from_pretrained( |
| model_id, |
| dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| device_map="auto", |
| ) |
| processor = AutoProcessor.from_pretrained(model_id) |
| ``` |
|
|
| ### 💬 Text-Only Inference |
|
|
| ```python |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "text", "text": "Explain the key symptoms and common risk factors of pneumonia."}, |
| ], |
| } |
| ] |
| |
| inputs = processor.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_dict=True, |
| return_tensors="pt", |
| ).to(model.device) |
| |
| generated_ids = model.generate(**inputs, max_new_tokens=256) |
| generated_ids_trimmed = [ |
| out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| ] |
| output_text = processor.batch_decode( |
| generated_ids_trimmed, |
| skip_special_tokens=True, |
| clean_up_tokenization_spaces=False, |
| ) |
| print(output_text[0]) |
| ``` |
|
|
| ### 🩻 Single-Image Inference |
|
|
| ```python |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": "examples/chest_xray.png"}, |
| {"type": "text", "text": "Describe the main radiological findings in this image."}, |
| ], |
| } |
| ] |
| |
| inputs = processor.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_dict=True, |
| return_tensors="pt", |
| ).to(model.device) |
| |
| generated_ids = model.generate(**inputs, max_new_tokens=256) |
| generated_ids_trimmed = [ |
| out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| ] |
| output_text = processor.batch_decode( |
| generated_ids_trimmed, |
| skip_special_tokens=True, |
| clean_up_tokenization_spaces=False, |
| ) |
| print(output_text[0]) |
| ``` |
|
|
| ### 🖼️ Multi-Image Inference |
|
|
| ```python |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": "examples/image_1.png"}, |
| {"type": "image", "image": "examples/image_2.png"}, |
| {"type": "text", "text": "Compare these two medical images and summarize the key differences."}, |
| ], |
| } |
| ] |
| |
| inputs = processor.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_dict=True, |
| return_tensors="pt", |
| ).to(model.device) |
| |
| generated_ids = model.generate(**inputs, max_new_tokens=256) |
| generated_ids_trimmed = [ |
| out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| ] |
| output_text = processor.batch_decode( |
| generated_ids_trimmed, |
| skip_special_tokens=True, |
| clean_up_tokenization_spaces=False, |
| ) |
| print(output_text[0]) |
| ``` |
|
|
| ### 🧠 3D Volume Inference |
|
|
| The repository inference path converts a 3D `.npy` volume into a sequence of 2D frames and sends it as a video-style input. |
|
|
| ```python |
| def ct_to_video(ct_path: str): |
| ct_pixels = np.load(ct_path) |
| ct_u8 = np.clip(ct_pixels * 255, 0, 255).astype(np.uint8) |
| |
| frames = [] |
| idx = np.linspace(1, len(ct_u8) - 2, 10, dtype=int) |
| for i in idx: |
| rgb = np.stack([ct_u8[i]] * 3, axis=-1) |
| frames.append(Image.fromarray(rgb, mode="RGB")) |
| return frames |
| |
| volume_frames = ct_to_video("examples/ct_volume.npy") |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "video", "video": volume_frames, "sample_fps": 2.0}, |
| {"type": "text", "text": "Analyze this CT volume and summarize the main findings."}, |
| ], |
| } |
| ] |
| |
| text = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| images, videos, video_kwargs = process_vision_info( |
| messages, |
| image_patch_size=16, |
| return_video_kwargs=True, |
| return_video_metadata=True, |
| ) |
| if videos is not None: |
| videos, video_metadatas = zip(*videos) |
| videos, video_metadatas = list(videos), list(video_metadatas) |
| else: |
| video_metadatas = None |
| |
| inputs = processor( |
| text=text, |
| images=images, |
| videos=videos, |
| video_metadata=video_metadatas, |
| return_tensors="pt", |
| do_resize=False, |
| **video_kwargs, |
| ).to(model.device) |
| |
| generated_ids = model.generate(**inputs, max_new_tokens=256) |
| generated_ids_trimmed = [ |
| out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| ] |
| output_text = processor.batch_decode( |
| generated_ids_trimmed, |
| skip_special_tokens=True, |
| clean_up_tokenization_spaces=False, |
| ) |
| print(output_text[0]) |
| ``` |
|
|
| ## 📚 4. Citation |
|
|
| If you find this model useful for your research, please cite: |
|
|
| ```bibtex |
| @misc{lin2025healthgptmedicallargevisionlanguage, |
| title={HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation}, |
| author={Tianwei Lin and Wenqiao Zhang and Sijing Li and Yuqian Yuan and Binhe Yu and Haoyuan Li and Wanggui He and Hao Jiang and Mengze Li and Xiaohui Song and Siliang Tang and Jun Xiao and Hui Lin and Yueting Zhuang and Beng Chin Ooi}, |
| year={2025}, |
| eprint={2502.09838}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2502.09838}, |
| } |
| ``` |
|
|