MedVision-V0-7B

MedVision-V0-7B is a vision-language model (VLM) for quantitative medical image analysis. It is fine-tuned from Qwen/Qwen2.5-VL-7B-Instruct on the MedVision dataset to perform three clinically relevant quantitative tasks end-to-end, without relying on external tools or specialist software:

  1. Detection — localization and identification of anatomical structures and abnormalities (bounding boxes).
  2. Tumor/Lesion (T/L) size estimation — bidirectional (major/minor axis) measurements.
  3. Angle/Distance (A/D) measurement — e.g. joint angles and inter-structure distances.

A distinguishing feature is that the model reasons about physical units (e.g. mm): it estimates landmark/endpoint coordinates, then converts them to real-world measurements using the pixel size and image size provided in the prompt. Its internal reasoning appears inside <think>...</think> tags, and the final answer inside <answer>...</answer> tags.


1. Base Model

Property Value
Backbone Qwen/Qwen2.5-VL-7B-Instruct
Parameters ~7B (8.3B including the vision encoder)
Modality Image + text → text (visual question answering)
Frameworks TRL (SFT), verl (RFT/GRPO)

The base model's own license and usage terms (Qwen2.5-VL) also apply; see License & Intended Use for details.


2. Training Data

The model is trained on the MedVision dataset (v1.0.0), a large-scale, multi-anatomy, multi-modality medical imaging dataset with quantitative annotations:

  • 30.8 million image–annotation pairs aggregated from 22 public datasets.
  • Modalities: CT, MRI, X-ray, ultrasound (US), PET — restricted to modalities that carry physical spacing (pixel size) information in their file headers, which is essential for generating ground-truth real-world measurements.
  • Anatomies: abdomen, brain, heart, kidney, knee, head & neck, tooth, fetal brain, whole body, and more.
  • Annotation types: bounding boxes, T/L size (major and minor axis lengths of a fitted ellipse), and angle/distance (derived from human-annotated landmarks).
  • All measurements are in clinically relevant real-world units (e.g. mm) rather than pixels.
  • Medical volumes follow standard RAS+ orientation and support axial, coronal, and sagittal views.
  • Subject-level split: 70% train / 30% test.

Training subset used for MedVision-V0: A multi-task subset of 121K samples drawn from the MedVision training split:

Task Samples
Detection 110K
T/L size estimation 5.5K
A/D measurement 5.5K
Total 121K

Only axial slices were used for training; coronal and sagittal slices are deliberately held out to test generalization to unseen imaging planes. Since detection accounts for the vast majority of samples, a weighted sampler ensures the model sees a balanced mix of all three tasks during training. Each training example is a 512×512 image paired with a question and expected answer.


3. Training Recipe

MedVision-V0 is trained in two stages: supervised fine-tuning (SFT) with step-by-step reasoning, followed by reinforcement fine-tuning (RFT) using the GRPO algorithm.

Stage 1 — Supervised Fine-Tuning (SFT) with Chain-of-Thought Reasoning

The model learns the required answer formats and reasoning patterns. Each training answer includes a step-by-step reasoning trace inside <think>...</think> followed by the final result inside <answer>...</answer>. The reasoning text is generated by inserting known correct intermediate values (e.g. landmark coordinates) into structured templates, so the model learns to first localize, then compute.

Setting Value
Method Full fine-tuning (all parameters)
Data 121K multi-task CoT samples (110K detect / 5.5K T/L / 5.5K A/D)
Image size 512×512
Epochs 3
Per-device batch size 8
Gradient accumulation 8
GPUs 4
Effective batch size 256
Precision bf16 mixed precision (FSDP FULL_SHARD)
Optimizations Flash-Attention 2, gradient checkpointing
Sampler Custom weighted random sampler (oversamples minority tasks)

Stage 2 — Reinforcement Fine-Tuning (RFT) via GRPO

The fine-tuned model is further trained with the GRPO reinforcement learning algorithm (implemented in https://github.com/YongchengYAO/verl/tree/medvision-rl). The same 121K samples are reused, but the step-by-step reasoning is removed — the model now learns by receiving scores on its outputs. Tasks are trained sequentially: A/D → T/L → Detection.

In addition to the standard GRPO format and answer scores, intermediate accuracy scores are designed for T/L and A/D tasks to reward correct intermediate steps (e.g. accurate landmark coordinates). All scores are computed as exp(-x), where x is the prediction error. The final score combines them as:

r = r_format + r_process * r_answer

This coupling means the answer score only contributes meaningfully when the intermediate localization step is also correct.

SFT yields large gains over the base model on all three tasks; the additional RFT stage produces further consistent improvements, including on unseen imaging planes (plane generalization) and unseen anatomical targets (target generalization).


4. Usage

MedVision-V0-7B is built on Qwen2.5-VL-7B-Instruct and loads with the standard Qwen2_5_VLForConditionalGeneration / AutoProcessor API. What is specific to this model is the prompt and output format it was trained on. The sections below cover the required system prompt, task prompts, and how to read the output.

4.1 Required output format (shared by all three tasks)

Use this system prompt for every request (it is the same one used at benchmark time, via the --use_system_prompt flag):

A conversation between a User and an Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks through the reasoning process internally, then provides the User with the answer. The reasoning process and the final answer must be enclosed within <think> </think> and <answer> </answer> tags, respectively. For example: <think> reasoning process here </think> <answer> answer here </answer>. Within the <think> </think> tags, report the reasoning process for each step inside <step-k-reasoning> </step-k-reasoning> tags, followed by the intermediate results in <step-k-answer> </step-k-answer> tags. For example: <think> <step-1-reasoning> reasoning for step 1 </step-1-reasoning> <step-1-answer> intermediate result from step 1 </step-1-answer> </think>.

4.2 The three tasks — prompt and answer formats

The released model was trained and benchmarked with the chain-of-thought (CoT) prompts below. Each prompt has up to four blocks — Task: / Additional information: / Format requirement: / Reasoning steps:. The exact templates come from medvision_utils.py (doc_to_text_*_CoT) and sft_prompts.py.

Quick reference (what the <answer> tag contains for each task):

Task Additional information:? What goes inside <answer> Example answer
Detection no 4 comma-separated decimals x0,y0,x1,y1relative coords in [0,1], origin at the image's lower-left corner (lower-left then upper-right). No units. <answer>0.31,0.42,0.55,0.68</answer>
T/L size yes 2 numbers: major axis, then minor axis, in real-world units. <answer>(24.13, 11.07)</answer>
A/D measurement yes a single number (angle in degrees, or distance in mm). <answer>3.42</answer>

Detection (doc_to_text_BoxCoordinate_CoT) — no Additional information: block:

Task:
Given the input medical image: <image_description>, return the coordinates of the lower-left and upper-right corners of the bounding box for the <label>.
Format requirement:
The reasoning process and the final answer must be enclosed within <think> </think> and <answer> </answer> tags, respectively. For example: <think> reasoning process here </think> <answer> answer here </answer>. The answer should be four decimal numbers separated by commas without any units or additional text. The first two numbers are the coordinates of the lower-left corner and the last two numbers are the coordinates of the upper-right corner of the bounding box. Use relative coordinates in the image space, where the origin is at the lower-left corner of the image. Relative coordinates should be values between 0 and 1, representing the relative positions in the image.
Reasoning steps:
Step 1: Identify the relative coordinates of the bounding box. The relative coordinates must be written as (x, y), where x is the relative position in width and y is the relative position in height. Report the reasoning process and final answer within <think> </think> and <answer> </answer> tags, respectively. Inside <think> </think>, include reasoning and step results using <step-k-reasoning> </step-k-reasoning> and <step-k-answer> </step-k-answer> tags.
Follow the reasoning steps to get the final answer in the required format.

T/L size (doc_to_text_TumorLesionSize_CoT):

Task:
Given the input medical image: <image_description>, estimate the major and minor axis lengths of the ellipse enclosing the <label>, in <unit>.
Additional information:
The image size is <W> pixels (width) x <H> pixels (height).
The pixel size for this image is <pw> <unit> (width) x <ph> <unit> (height).
Format requirement:
The final answer must be enclosed within <answer> </answer> tags. The answer should consist of two decimal numbers separated by a comma, without units or extra text. The first number is the major axis length, and the second is the minor axis length.
Reasoning steps:
Step 1: Identify the major axis (the longest diameter) of the ellipse enclosing the target region. Find its two endpoints and record their relative coordinates in the format (x, y) = (relative position in width direction, relative position in height direction). Denote the endpoints as (x1_major, y1_major) and (x2_major, y2_major). Step 2: Identify the minor axis (the shortest diameter) of the ellipse. Find its two endpoints and record their relative coordinates in the same (x, y) format. Denote them as (x1_minor, y1_minor) and (x2_minor, y2_minor). Step 3: Given the pixel dimensions (pixel_width, pixel_height) and image size (image_width, image_height), compute the physical length of the major axis using: major_axis_length = sqrt(((x2_major - x1_major) * image_width * pixel_width)^2 + ((y2_major - y1_major) * image_height * pixel_height)^2). Step 4: Similarly, compute the physical length of the minor axis using: minor_axis_length = sqrt(((x2_minor - x1_minor) * image_width * pixel_width)^2 + ((y2_minor - y1_minor) * image_height * pixel_height)^2). Report the reasoning process and final answer within <think> </think> and <answer> </answer> tags, respectively. Inside <think> </think>, include reasoning and step results using <step-k-reasoning> </step-k-reasoning> and <step-k-answer> </step-k-answer> tags.
Follow the reasoning steps to get the final answer in the required format.

A/D measurement (doc_to_text_BiometricsFromLandmarks_CoT) — the Task: line and Reasoning steps: differ by metric type (distance vs. angle):

Task:
Given the input medical image: <image_description>, <task line>
Additional information:
The image size is <W> pixels (width) x <H> pixels (height).
The pixel size for this image is <pw> <unit> (width) x <ph> <unit> (height).
Format requirement:
The final answer must be enclosed within <answer> </answer> tags. The answer should be a single decimal number without units or extra text.
Reasoning steps:
<reasoning steps>
Follow the reasoning steps to get the final answer in the required format.
  • Distance

    • <task line>: estimate the distance of <name> in <unit>, which is the distance between 2 landmark points: (landmark 1) <p1>, (landmark 2) <p2>.
    • <reasoning steps>: Step 1: Identify the landmark 1 and record its relative coordinates in the format (x, y) = (relative position in width direction, relative position in height direction). Denote the coordinates as (x1, y1). Step 2: Identify the landmark 2 and record its relative coordinates in the same (x, y) format. Denote the coordinates as (x2, y2). Step 3: Given the pixel dimensions (pixel_width, pixel_height) and image size (image_width, image_height), compute the physical distance between the two landmarks using: distance = sqrt(((x2 - x1) * image_width * pixel_width)^2 + ((y2 - y1) * image_height * pixel_height)^2). Report the reasoning process and final answer within <think> </think> and <answer> </answer> tags, respectively. Inside <think> </think>, include reasoning and step results using <step-k-reasoning> </step-k-reasoning> and <step-k-answer> </step-k-answer> tags.
  • Angle

    • <task line>: estimate the angle of <name> in <unit>, which is the angle between 2 lines: (line 1) the line connecting <l1p1> and <l1p2>, (line 2) the line connecting <l2p1> and <l2p2>.
    • <reasoning steps>: Step 1: Identify line 1 and record the relative coordinates of its two endpoints in the format (x, y) = (relative position in width direction, relative position in height direction). Denote the endpoints as (x1_line1, y1_line1) and (x2_line1, y2_line1). Step 2: Identify line 2 and record the relative coordinates of its two endpoints in the same (x, y) format. Denote them as (x1_line2, y1_line2) and (x2_line2, y2_line2). Step 3: Given the pixel dimensions (pixel_width, pixel_height) and image size (image_width, image_height), compute the angle between the two lines using the formula: angle = arccos(|A · B| / (||A|| ||B||)), where A and B are the vectors of the two lines computed from the physical coordinates of their endpoints. A = ((x2_line1 - x1_line1) * image_width * pixel_width, (y2_line1 - y1_line1) * image_height * pixel_height) and B = ((x2_line2 - x1_line2) * image_width * pixel_width, (y2_line2 - y1_line2) * image_height * pixel_height). Denote A=(Ax, Ay) and B=(Bx, By). Then, angle = arccos(|Ax*Bx + Ay*By| / (sqrt(Ax^2 + Ay^2) * sqrt(Bx^2 + By^2))). Report the reasoning process and final answer within <think> </think> and <answer> </answer> tags, respectively. Inside <think> </think>, include reasoning and step results using <step-k-reasoning> </step-k-reasoning> and <step-k-answer> </step-k-answer> tags.

⚠️ State the image size and pixel size as the model sees them. Qwen2.5-VL resizes images to a multiple of 28 internally — for example, a 512×512 input becomes 504×504. Always provide the image size and pixel size after this resize, not from the original file. The simplest approach: resize to 504×504 before inference and scale the pixel size by the same ratio (as in §4.3). Detection is exempt — it uses unitless relative coordinates and needs no pixel size.

4.3 Quick start (direct inference)

import re
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

MODEL_ID = "YongchengYAO/MedVision-V0-7B"
SYSTEM_PROMPT = (
    "A conversation between a User and an Assistant. The User asks a question, and the Assistant solves it. "
    "The Assistant first thinks through the reasoning process internally, then provides the User with the answer. "
    "The reasoning process and the final answer must be enclosed within <think> </think> and <answer> </answer> tags, respectively. "
    "For example: <think> reasoning process here </think> <answer> answer here </answer>. "
    "Within the <think> </think> tags, report the reasoning process for each step inside <step-k-reasoning> </step-k-reasoning> tags, "
    "followed by the intermediate results in <step-k-answer> </step-k-answer> tags. "
    "For example: <think> <step-1-reasoning> reasoning for step 1 </step-1-reasoning> <step-1-answer> intermediate result from step 1 </step-1-answer> </think>."
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID, torch_dtype="bfloat16", device_map="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# A T/L size example. `image` should be a 504x504 RGB PIL.Image. We use 504 (= 18 x 28)
# because Qwen2.5-VL resizes images to a multiple of 28, so a 504x504 input is processed
# as-is and the image size / pixel size stated below match exactly what the model sees.
# (A 512x512 input would instead be processed at 504x504; see the pixel-size note above.)
question = (
    "Task:\n"
    "Given the input medical image, estimate the major and minor axis lengths of the "
    "ellipse enclosing the tumor, in millimeters.\n"
    "Additional information:\n"
    "The image size is 504 pixels (width) x 504 pixels (height).\n"
    "The pixel size for this image is 0.700 millimeters (width) x 0.700 millimeters (height).\n"
    "Format requirement:\n"
    "The final answer must be enclosed within <answer> </answer> tags. "
    "The answer should consist of two decimal numbers separated by a comma, without units or extra text. "
    "The first number is the major axis length, and the second is the minor axis length.\n"
    "Reasoning steps:\n"
    "Step 1: Identify the major axis (the longest diameter) of the ellipse enclosing the target region. "
    "Find its two endpoints and record their relative coordinates in the format (x, y) = (relative position in width direction, relative position in height direction). "
    "Denote the endpoints as (x1_major, y1_major) and (x2_major, y2_major). "
    "Step 2: Identify the minor axis (the shortest diameter) of the ellipse. "
    "Find its two endpoints and record their relative coordinates in the same (x, y) format. "
    "Denote them as (x1_minor, y1_minor) and (x2_minor, y2_minor). "
    "Step 3: Given the pixel dimensions (pixel_width, pixel_height) and image size (image_width, image_height), compute the physical length of the major axis using: "
    "major_axis_length = sqrt(((x2_major - x1_major) * image_width * pixel_width)^2 + ((y2_major - y1_major) * image_height * pixel_height)^2). "
    "Step 4: Similarly, compute the physical length of the minor axis using: "
    "minor_axis_length = sqrt(((x2_minor - x1_minor) * image_width * pixel_width)^2 + ((y2_minor - y1_minor) * image_height * pixel_height)^2). "
    "Report the reasoning process and final answer within <think> </think> and <answer> </answer> tags, respectively. "
    "Inside <think> </think>, include reasoning and step results using "
    "<step-k-reasoning> </step-k-reasoning> and <step-k-answer> </step-k-answer> tags.\n"
    "Follow the reasoning steps to get the final answer in the required format."
)

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "image", "image": image},   # 504x504 PIL.Image
        {"type": "text", "text": question},
    ]},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs,
                   padding=True, return_tensors="pt").to(model.device)

generated = model.generate(**inputs, max_new_tokens=4096)
trimmed = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated)]
output = processor.batch_decode(trimmed, skip_special_tokens=True)[0]

# Parse the final values from <answer>...</answer>, using the same strategy as the
# benchmark (medvision_bm.benchmark.parse_outputs -> extract_last_k_nums_within_answer_tag):
# pull every number inside the <answer> tag and keep the LAST k of them (k=2 for T/L:
# major, minor). 
EXPECTED_NUMS = 2  # T/L: major, minor. Use 1 for A/D, 4 for Detection.
m = re.search(r"<answer>(.*?)</answer>", output, re.DOTALL)
numbers = re.findall(r"-?\d+\.?\d*", m.group(1)) if m else []
values = [float(x) for x in numbers[-EXPECTED_NUMS:]] if len(numbers) >= EXPECTED_NUMS else None
print(output)            # full <think>…</think><answer>…</answer> trace
print("major, minor:", values)

To switch tasks, swap in the corresponding template from §4.2 — the Task: line, the Format requirement:, and the Reasoning steps: block all change per task (and set EXPECTED_NUMS accordingly: 4 for Detection, 1 for A/D). Detection omits the Additional information: block and returns the four bounding-box corners; A/D adds the Additional information: block and returns a single number.

4.4 Reproducing the MedVision benchmark

To reproduce the benchmark, use the medvision_bm.benchmark.eval__medvision-model-rft entry point (vLLM backend, vllm_qwen25vl). Ready-to-run scripts are in script/benchmark-{AD,TL,detect}/:

Task Script tasks_list JSON
A/D script/benchmark-AD/eval__MedVision-V0-7B__AD.sh tasks_MedVision-AD-CoT.json
T/L script/benchmark-TL/eval__MedVision-V0-7B__TL.sh tasks_MedVision-TL-CoT.json
Detection script/benchmark-detect/eval__MedVision-V0-7B__detect.sh tasks_MedVision-detect-CoT.json

The scripts are identical apart from the task tag and tasks-list JSON; the shared invocation is:

export MedVision_PLANNER_VERSION='1.0.0'   # MedVision dataset v1.0.0

python -m medvision_bm.benchmark.eval__medvision-model-rft \
  --model_hf_id YongchengYAO/MedVision-V0-7B \
  --model_name MedVision-V0-7B \
  --results_dir <results_dir> \
  --data_dir <data_dir> \
  --tasks_list_json_path <tasks_list_json> \
  --task_status_json_path <status_json> \
  --batch_size_per_gpu 10 \
  --gpu_memory_utilization 0.9 \
  --sample_limit 1000 \
  --reshape_image_hw 512x512 \
  --use_system_prompt          # injects the §4.1 system prompt — required for this model

Then parse and summarize the outputs with medvision_bm.benchmark.parse_outputs and the summarize_{AD,TL,detection}_task modules. See the code repository for the full pipeline.


5. Performance

📊 Detailed benchmark results are available on the project page.


License & Intended Use

License. MedVision-V0-7B is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the model for any purpose, including commercially, provided appropriate credit is given. The base model (Qwen2.5-VL-7B-Instruct) is subject to its own license terms, which also apply.

⚠️ Not for clinical use. Current state-of-the-art VLMs are not yet capable of accurate, robust medical image detection and measurement. While MedVision-V0 substantially improves over off-the-shelf models, it remains far from the accuracy and robustness required for clinical application and must not be used to drive any medical diagnosis or clinical decision-making.

Data privacy. All source imaging datasets were publicly released in anonymized form by their respective curators. MedVision's added annotations (bounding boxes, size, and angle/distance measurements) are purely geometric descriptors and contain no subject-identifying information.


Citation

@article{yao2025medvision,
  title   = {MedVision: Benchmarking Quantitative Medical Image Analysis},
  author  = {Yao, Yongcheng and Zong, Yongshuo and Dutt, Raman and Yang, Yongxin and Tsaftaris, Sotirios A and Hospedales, Timothy},
  journal = {arXiv preprint arXiv:2511.18676},
  year    = {2025}
}
Downloads last month
373
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YongchengYAO/MedVision-V0-7B

Finetuned
(1129)
this model
Quantizations
1 model

Dataset used to train YongchengYAO/MedVision-V0-7B

Space using YongchengYAO/MedVision-V0-7B 1

Collection including YongchengYAO/MedVision-V0-7B

Paper for YongchengYAO/MedVision-V0-7B