You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Access to MedGPT-oss is restricted to academic and non-profit research. Commercial use, clinical decision-making, and integration into deployed clinical systems are prohibited. The model was trained in part on data subject to use restrictions; by requesting access you confirm that you have read the LICENSE and that your intended use is research only.

Log in or Sign Up to review the conditions and access this model content.

MedGPT-oss

MedGPT-oss is an open-weight ~20B-parameter vision–language model for biomedicine, built on GPT-oss-20B with a CLIP-ViT-L/14 @ 336 px visual encoder and a two-layer MLP projector. It is trained with a three-stage curriculum (alignment → long-context mid-training → instruction tuning) and is positioned as a research foundation for medical AI work that needs on-premises, privacy-preserving inference.

📄 Paper: arXiv:2603.00842


⚠️ License and usage restrictions (please read before requesting access)

MedGPT-oss was trained in part on data subject to use restrictions. The weights are released under CC BY-NC 4.0 for research use only. By using this model you agree:

  • No commercial use of any kind (products, services, paid APIs, or internal commercial workflows).
  • No clinical decision-making. The model must not inform, assist, or replace any real-world diagnosis, treatment plan, or patient-facing service. Outputs are not medical advice.
  • No re-identification of any individual whose data may have contributed to training.
  • Use of the Python source files in this repo is permitted under MIT (derived from InternVL, Copyright 2024 OpenGVLab); the source-code MIT license does not override the non-commercial restriction on the weights.

See LICENSE for the full terms.


Intended use

MedGPT-oss is intended as a research foundation for biomedical vision–language work, not as a drop-in clinical tool. Typical uses:

  • Studying multimodal medical reasoning in a fully open-weight setting.
  • Benchmarking on medical VQA, multiple-choice QA, and radiology report generation tasks.
  • Fine-tuning on institutional research datasets where on-premises inference is required.
  • Comparing against other medical VLMs in a controlled research setting.

Researchers are expected to perform their own validation and adaptation before any use that goes beyond reading benchmark numbers. The model should be treated as a starting point that requires meaningful modification and external validation for any downstream use case.

Out-of-scope use

The following are explicitly prohibited:

  • Clinical decision-making. Outputs must not inform diagnoses, treatment plans, triage, or any patient-facing decision.
  • Deployed clinical systems. No integration into EHR systems, clinical workflows, or patient-facing software.
  • Commercial use of any kind.
  • Re-identification of individuals from the training distribution.
  • Multi-turn / agentic applications — the model was trained and evaluated as a single-turn predictor; multi-turn behavior is unvalidated.
  • Unsupervised consumer use. Outputs are not medical advice and must not be presented as such.

Requirements

pip install "transformers>=4.55" torch torchvision pillow

Weights ship in bf16 (~40 GB on disk). One ≥48 GB GPU (or two 24 GB GPUs with device_map="auto") is comfortable.


Quickstart

import torch
from transformers import AutoModel, AutoTokenizer

MODEL_ID = "UFNLP/MedGPT-oss"

model = AutoModel.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True, use_fast=False)

answer = model.chat(
    tokenizer,
    pixel_values=None,
    question="What is the first-line outpatient antibiotic for community-acquired pneumonia in healthy adults?",
    generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)
# -> "The first-line outpatient antibiotic for community-acquired pneumonia in healthy adults is Amoxicillin."

model.chat() automatically:

  • formats the prompt with the MedGPT-oss Harmony system message;
  • generates with the supplied generation_config;
  • strips the <|channel|>final<|message|> … <|end|> envelope;
  • removes repeated sentences (regex-based dedup).

Pass postprocess=False to get the raw decoded string.


Image + text

How image input works

MedGPT-oss uses InternVL-style dynamic tiling: an input image is split into up to max_num non-overlapping 336×336 tiles whose aspect ratio best matches the original image, plus one optional 336×336 thumbnail of the full image. Each tile is encoded by the CLIP ViT-L/14 vision tower and fed to the language model as a span of image tokens. Tiling preserves resolution on high-detail medical images (radiographs, pathology slides) without down-sampling to a single 336×336 crop.

To run the model on an image you:

  1. Call load_image(path, max_num=N) to get a (n_tiles, 3, 336, 336) bf16 tensor.
  2. Put one <image> placeholder in your question string for each image. If you pass pixel_values but forget the placeholder, model.chat will prepend one automatically (single-image case only).
  3. Pass the tensor to model.chat(..., pixel_values=...).

max_num is a budget. max_num=12 (the default) lets a 4032×3024 X-ray become 4×3 tiles + 1 thumbnail (13 tiles), each 336². Larger values preserve more detail at the cost of more image tokens (~256 LLM tokens per tile). Use max_num=1 to force a single global view.

The helper below is self-contained — no extra packages.

import torch, torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from PIL import Image

IM_MEAN, IM_STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)

def _transform(sz=336):
    return T.Compose([
        T.Lambda(lambda i: i.convert("RGB") if i.mode != "RGB" else i),
        T.Resize((sz, sz), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(IM_MEAN, IM_STD),
    ])

def _closest_ratio(ar, ratios, w, h, sz):
    best, best_diff = (1, 1), float("inf")
    for r in ratios:
        d = abs(ar - r[0] / r[1])
        if d < best_diff:
            best, best_diff = r, d
        elif d == best_diff and w * h > 0.5 * sz * sz * r[0] * r[1]:
            best = r
    return best

def _tile(img, max_num=12, sz=336, thumb=True):
    w, h = img.size
    ratios = sorted(
        {(i, j) for n in range(1, max_num + 1)
         for i in range(1, n + 1) for j in range(1, n + 1) if 1 <= i * j <= max_num},
        key=lambda x: x[0] * x[1],
    )
    r = _closest_ratio(w / h, ratios, w, h, sz)
    resized = img.resize((sz * r[0], sz * r[1]))
    tiles = [
        resized.crop(((i % r[0]) * sz, (i // r[0]) * sz,
                     (i % r[0] + 1) * sz, (i // r[0] + 1) * sz))
        for i in range(r[0] * r[1])
    ]
    if thumb and len(tiles) > 1:
        tiles.append(img.resize((sz, sz)))
    return tiles

def load_image(path, max_num=12, sz=336):
    img = Image.open(path).convert("RGB")
    tf = _transform(sz)
    return torch.stack([tf(t) for t in _tile(img, max_num, sz)])

pixel_values = load_image("chest_xray.png").to(torch.bfloat16).to(model.device)

answer = model.chat(
    tokenizer,
    pixel_values=pixel_values,
    question="<image>\nWrite the Findings section of this chest X-ray. Be concise and clinical.",
    generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)

Multiple images per turn

For prompts referencing more than one image (e.g. a current and a prior study), tile each image separately and concatenate the tensors. Put one <image> placeholder per image, in the same order:

pv_current = load_image("current.jpg", max_num=8)
pv_prior   = load_image("prior.jpg",   max_num=8)
pixel_values = torch.cat([pv_current, pv_prior], dim=0).to(torch.bfloat16).to(model.device)

answer = model.chat(
    tokenizer,
    pixel_values=pixel_values,
    question=(
        "Given the current AP image <image> and the prior AP image <image>, "
        "describe interval changes."
    ),
    generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)

Multiple-choice helper

For MCQ benchmarks the model emits a single letter:

import re
ans = model.chat(
    tokenizer,
    pixel_values=pixel_values,
    question="<image>\nIs there evidence of an aortic aneurysm?\n"
             "Answer Choices: (A) Yes, (B) No.\nPlease select the correct option.",
    generation_config=dict(max_new_tokens=8, do_sample=False),
)
letter = re.search(r"\b([A-E])\b", ans).group(1)

Evaluation

All numbers below are from the MedGPT-oss paper. We compare MedGPT-oss-20B against contemporaneous open-weight medical foundation models. Bold = best in row.

Medical Visual Question Answering (accuracy)

Dataset MedGPT-oss-20B OctoMed-7B Hulu-Med-32B Lingshu-32B MedGemma-27B QoQ-Med-32B
MedXQA (multimodal) 49.23 25.60 34.35 31.43 30.90 29.64
SLAKE 71.53 65.07 69.14 72.24 55.98 46.53
MedFrameQA 63.01 42.82 62.82 61.01 47.63 55.73
MMMU-Med (dev) 61.49 47.65 57.71 59.43 47.43 51.84
MMMU-Med-Pro (4 opt.) 52.34 44.62 52.45 52.67 45.80 46.93
MMMU-Med-Pro (10 opt.) 39.94 23.07 37.41 43.45 36.71 38.12

Text-only medical QA (accuracy)

Dataset MedGPT-oss-20B OctoMed-7B Hulu-Med-32B Lingshu-32B MedGemma-27B QoQ-Med-32B
MedQA 70.11 44.97 77.18 70.71 66.75 55.25
PubMedQA 57.81 48.31 61.00 62.44 55.80 42.80
MedMCQA 62.53 55.32 72.75 65.27 65.48 51.42
MedXQA (text) 25.38 10.86 23.47 21.47 14.37 8.78
MMLU-Med 72.59 61.65 87.10 82.68 80.65 74.98
MedBullets 68.71 32.21 67.45 58.69 51.34 37.25

Chest X-ray report generation (MIMIC-CXR)

Metric MedGPT-oss-20B OctoMed-7B Hulu-Med-32B Lingshu-32B MedGemma-27B QoQ-Med-32B
RadGraph-F1 0.189 0.129 0.267 0.278 0.095 0.166
RaTEScore 0.522 0.402 0.575 0.590 0.291 0.482
1 / RadCliQ-v1 0.803 0.615 0.867 0.881 0.505 0.748

Clinical in-context learning

Scores are accuracy (%) for Patient–trial matching and RaTEScore for Impression generation.

Task Metric Setting MedGPT-oss-20B OctoMed-7B Hulu-Med-32B Lingshu-32B MedGemma-27B QoQ-Med-32B
Patient–trial Acc. (%) 0-shot 48.81 40.96 51.01 52.07 31.03 45.20
Patient–trial Acc. (%) 1-shot 55.60 40.02 47.00 48.91 52.24 47.41
Impression RaTEScore 0-shot 47.22 31.04 43.14 43.80 38.42 41.44
Impression RaTEScore 1-shot 47.25 30.91 41.52 40.27 38.29 40.71

Notes

  • Do not use AutoProcessor. This repo does not register an image processor.
  • EOS / stop token: <|return|> (id 200002). model.chat sets it automatically.
  • postprocess=False returns the raw <|channel|>final<|message|>…<|end|> envelope — useful if you want to inspect Harmony channels yourself.
  • Continued training / fine-tuning: set model.config.llm_config.output_router_logits = True before .forward() so the MoE load-balancing aux loss is computed. The shipped config defaults this to False for inference.

Citation

@article{zhang2026medgptoss,
  title   = {MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine},
  author  = {Zhang, Kai and Yuan, Zhengqing and Peng, Cheng and Zhao, Songlin and
             Lyu, Mengxian and Chen, Ziyi and Ye, Yanfang and Liu, Wei and
             Zhang, Ying and Smith, Kaleb E. and He, Lifang and Sun, Lichao and Wu, Yonghui},
  journal = {arXiv preprint arXiv:2603.00842},
  year    = {2026}
}

Contact

Lichao Sun (lis221@lehigh.edu) · Yonghui Wu (yonghui.wu@ufl.edu)

Downloads last month
49
Safetensors
Model size
21B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for UFNLP/MedGPT-oss