UniReason-Med / README.md
RobinChen2001's picture
Add model card and Apache-2.0 license
35dda76 verified
|
Raw
History Blame
5.34 kB
metadata
license: apache-2.0
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - medical
  - multimodal
  - vqa
  - visual-grounding
  - chain-of-thought
  - reinforcement-learning
  - grpo
  - qwen2_5_vl
language:
  - en
datasets:
  - IQuestLab/UniReason-Med-Data

UniReason-Med

UniReason-Med is a medical multimodal model that accompanies the paper "UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA".

It studies whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both modalities share a common reasoning interface. A single checkpoint processes either a 2D image or a slice-serialized 3D volume, generating interleaved textual reasoning and localized visual evidence through shared bounding-box syntax and region-token injection under a common grounded reasoning policy.

Model Description

UniReason-Med is trained to interleave free-form reasoning with localized visual evidence. During reasoning, the model emits bounding boxes over the input image; the referenced region is cropped and re-injected as additional visual context for the next reasoning step (a grounded chain-of-thought, GCoT, interface). The same shared interface is applied to 2D images and to 3D volumes serialized as ordered slice sequences, which allows grounded supervision collected on plentiful 2D data to transfer to 3D reasoning.

A central result of the paper is that joint 2D+3D grounded supervision improves 3D reasoning compared with 3D-only training under matched schedules, while the shared grounding interface also benefits 2D tasks.

Training

The model is built with a two-stage recipe:

  1. Supervised fine-tuning (SFT) on the UniMed-CoT dataset — 220K grounded chain-of-thought samples (170K 2D + 50K 3D) with interleaved textual reasoning and grounded visual evidence. Vision tower and the multimodal projector are frozen; the language model is fully fine-tuned.
  2. Reinforcement learning (GRPO) with outcome-level rewards. RL uses answer-correctness and format rewards rather than ground-truth localization-overlap rewards such as IoU or Dice.

This checkpoint is the merged Hugging Face model exported from the GRPO stage.

Training code (LLaMA-Factory for SFT, verl for GRPO) and configs are released at: https://github.com/IQuestLab/unireason-med.

Usage

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "IQuestLab/UniReason-Med"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")

image = Image.open("medical_image.png")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is the most likely diagnosis? Reason step by step."},
        ],
    }
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(output, skip_special_tokens=True)[0])

The model produces interleaved reasoning with bounding boxes over the input image. Reproducing the full grounded crop-and-continue loop (crop the predicted region and feed it back as visual input) follows the agent/rollout logic in the released training code.

Intended Use and Limitations

  • Intended use: research on medical multimodal reasoning, visual grounding, and 2D-to-3D transfer. Suitable for academic benchmarking and method development.
  • Out of scope: UniReason-Med is a research artifact and is not a medical device. It must not be used for clinical diagnosis, treatment decisions, or any real patient care.
  • Limitations: outputs may be incorrect, incomplete, or biased; performance depends on imaging modality, anatomy, and distribution shift from the training data. Predicted bounding boxes are reasoning aids, not validated localization. Always involve qualified medical professionals for any health-related decision.

Data Notice

The public training-data release keeps 3D examples text-only and does not redistribute 3D image data, because those samples are derived from M3D whose underlying image sources include Radiopaedia and may require separate authorization. See the dataset card for details.

License

Released under the Apache License 2.0, consistent with the base model Qwen2.5-VL-7B-Instruct. Note the research-only intended use and the medical-use limitations above.

Citation

If you use this model, please cite the UniReason-Med paper:

@article{unireasonmed,
  title  = {UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA},
  author = {UniReason-Med Team},
  year   = {2025}
}