|
|
--- |
|
|
license: gemma |
|
|
datasets: |
|
|
- General-Medical-AI/GMAI-Reasoning10K |
|
|
base_model: |
|
|
- google/gemma-3-4b-it |
|
|
--- |
|
|
# IntrinSight: A Large Vision-Language Model for Medical Insight |
|
|
|
|
|
**IntrinSight** is a cutting-edge **Large Vision-Language Model (LVLM)**, fine-tuned for advanced reasoning and analysis within the medical domain. It is designed to act as a "wisdom mirror," capable of directly interpreting **medical images (such as X-rays, CT scans, and MRIs)** and synthesizing this visual information with associated textual data (like clinical notes or questions) to assist healthcare professionals in making more precise judgments. |
|
|
|
|
|
Unlike traditional language models that only process text, Inspirit-Insight can "see." It grounds its reasoning in visual evidence, making it a powerful tool for tasks like anomaly detection in scans, image-based diagnosis assistance, and generating descriptive reports from visual data. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
**Base Model**: [Gemma-3-4B-IT](https://huggingface.co/google/gemma-3-4b-it) |
|
|
|
|
|
**Training Dataset**: **[GMAI-Reasoning10K](https://huggingface.co/datasets/General-Medical-AI/GMAI-Reasoning10K)**. This is a high-quality medical image reasoning dataset containing 10,000 carefully selected samples. The data was collected from 95 medical datasets from reliable sources such as Kaggle, GrandChallenge, and Open-Release, covering 12 imaging modalities including X-ray, CT, and MRI. Data preprocessing followed the standardization methods from SAMed-20M: 3D data (CT/MRI) had individual slices extracted with pixel values normalized to the 0-255 range, while video data had key frames extracted. For each sample, key metadata was used with GPT to construct an informative multiple-choice question with a single correct answer. Strict quality control and reject sampling strategies were employed to ensure the high quality and reliability of the final dataset. |
|
|
|
|
|
**Training Framework**: [VeRL](https://github.com/volcengine/verl) |
|
|
|
|
|
## Training Process |
|
|
|
|
|
The model was trained for **3 epochs** using the innovative **[DrGRPO](https://arxiv.org/abs/2503.20783)** (Group Reward Policy Optimization) algorithm. The core of the training was to teach the model to anchor its textual reasoning in the visual evidence from the images. |
|
|
|
|
|
We use **Format Reward**,**Accuracy Reward**, and **Repetition Penalty** three reward functions. |
|
|
|
|
|
The entire training pipeline was constructed and managed using the **VeRL** framework, which provides a robust and efficient environment for reinforcement learning-based model training. |
|
|
|
|
|
## How to use |
|
|
|
|
|
We recommend to use the system prompt to enable the reasoning mode. The following prompt is one sample: |
|
|
|
|
|
```python |
|
|
SYSTEM_PROMPT = ( |
|
|
"A conversation between user and assistant. The user asks a question, and the assistant solves it. The assistant " |
|
|
"first thinks about the reasoning process in the mind and then provides the user with the answer." |
|
|
"The reasoning process is to solve the problem step by step, so you will think about it sinceraly." |
|
|
"The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., " |
|
|
"<think> reasoning process here </think><answer> answer here </answer>." |
|
|
) |
|
|
``` |
|
|
|
|
|
A larger budget for tokens may enhance the model's performance. You could try to use a larger `max_tokens`, for example, 16384. |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
**For Research and Assisting Purposes Only.** This model is an experimental tool developed for academic and research purposes. It is **not a medical device** and is **not intended to replace the professional judgment of a qualified healthcare provider**. Any output from Inspirit-Insight, including its interpretation of images, should be carefully reviewed and verified by a medical professional before being used for clinical decision-making. The developers assume no liability for any actions taken based on the model's output. |