You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Access to MedGPT-oss is restricted to academic and non-profit research. Commercial use, clinical decision-making, and integration into deployed clinical systems are prohibited. The model was trained in part on data subject to use restrictions; by requesting access you confirm that you have read the LICENSE and that your intended use is research only.
Log in or Sign Up to review the conditions and access this model content.
MedGPT-oss
MedGPT-oss is an open-weight ~20B-parameter vision–language model for biomedicine, built on GPT-oss-20B with a CLIP-ViT-L/14 @ 336 px visual encoder and a two-layer MLP projector. It is trained with a three-stage curriculum (alignment → long-context mid-training → instruction tuning) and is positioned as a research foundation for medical AI work that needs on-premises, privacy-preserving inference.
📄 Paper: arXiv:2603.00842
⚠️ License and usage restrictions (please read before requesting access)
MedGPT-oss was trained in part on data subject to use restrictions. The weights are released under CC BY-NC 4.0 for research use only. By using this model you agree:
- No commercial use of any kind (products, services, paid APIs, or internal commercial workflows).
- No clinical decision-making. The model must not inform, assist, or replace any real-world diagnosis, treatment plan, or patient-facing service. Outputs are not medical advice.
- No re-identification of any individual whose data may have contributed to training.
- Use of the Python source files in this repo is permitted under MIT (derived from InternVL, Copyright 2024 OpenGVLab); the source-code MIT license does not override the non-commercial restriction on the weights.
See LICENSE for the full terms.
Intended use
MedGPT-oss is intended as a research foundation for biomedical vision–language work, not as a drop-in clinical tool. Typical uses:
- Studying multimodal medical reasoning in a fully open-weight setting.
- Benchmarking on medical VQA, multiple-choice QA, and radiology report generation tasks.
- Fine-tuning on institutional research datasets where on-premises inference is required.
- Comparing against other medical VLMs in a controlled research setting.
Researchers are expected to perform their own validation and adaptation before any use that goes beyond reading benchmark numbers. The model should be treated as a starting point that requires meaningful modification and external validation for any downstream use case.
Out-of-scope use
The following are explicitly prohibited:
- Clinical decision-making. Outputs must not inform diagnoses, treatment plans, triage, or any patient-facing decision.
- Deployed clinical systems. No integration into EHR systems, clinical workflows, or patient-facing software.
- Commercial use of any kind.
- Re-identification of individuals from the training distribution.
- Multi-turn / agentic applications — the model was trained and evaluated as a single-turn predictor; multi-turn behavior is unvalidated.
- Unsupervised consumer use. Outputs are not medical advice and must not be presented as such.
Requirements
pip install "transformers>=4.55" torch torchvision pillow
Weights ship in bf16 (~40 GB on disk). One ≥48 GB GPU (or two 24 GB
GPUs with device_map="auto") is comfortable.
Quickstart
import torch
from transformers import AutoModel, AutoTokenizer
MODEL_ID = "UFNLP/MedGPT-oss"
model = AutoModel.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True, use_fast=False)
answer = model.chat(
tokenizer,
pixel_values=None,
question="What is the first-line outpatient antibiotic for community-acquired pneumonia in healthy adults?",
generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)
# -> "The first-line outpatient antibiotic for community-acquired pneumonia in healthy adults is Amoxicillin."
model.chat() automatically:
- formats the prompt with the MedGPT-oss Harmony system message;
- generates with the supplied
generation_config; - strips the
<|channel|>final<|message|> … <|end|>envelope; - removes repeated sentences (regex-based dedup).
Pass postprocess=False to get the raw decoded string.
Image + text
How image input works
MedGPT-oss uses InternVL-style dynamic tiling: an input image is split
into up to max_num non-overlapping 336×336 tiles whose aspect ratio
best matches the original image, plus one optional 336×336 thumbnail of
the full image. Each tile is encoded by the CLIP ViT-L/14 vision tower
and fed to the language model as a span of image tokens. Tiling
preserves resolution on high-detail medical images (radiographs,
pathology slides) without down-sampling to a single 336×336 crop.
To run the model on an image you:
- Call
load_image(path, max_num=N)to get a(n_tiles, 3, 336, 336)bf16 tensor. - Put one
<image>placeholder in yourquestionstring for each image. If you passpixel_valuesbut forget the placeholder,model.chatwill prepend one automatically (single-image case only). - Pass the tensor to
model.chat(..., pixel_values=...).
max_num is a budget. max_num=12 (the default) lets a 4032×3024 X-ray
become 4×3 tiles + 1 thumbnail (13 tiles), each 336². Larger values
preserve more detail at the cost of more image tokens (~256 LLM tokens
per tile). Use max_num=1 to force a single global view.
The helper below is self-contained — no extra packages.
import torch, torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from PIL import Image
IM_MEAN, IM_STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)
def _transform(sz=336):
return T.Compose([
T.Lambda(lambda i: i.convert("RGB") if i.mode != "RGB" else i),
T.Resize((sz, sz), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(IM_MEAN, IM_STD),
])
def _closest_ratio(ar, ratios, w, h, sz):
best, best_diff = (1, 1), float("inf")
for r in ratios:
d = abs(ar - r[0] / r[1])
if d < best_diff:
best, best_diff = r, d
elif d == best_diff and w * h > 0.5 * sz * sz * r[0] * r[1]:
best = r
return best
def _tile(img, max_num=12, sz=336, thumb=True):
w, h = img.size
ratios = sorted(
{(i, j) for n in range(1, max_num + 1)
for i in range(1, n + 1) for j in range(1, n + 1) if 1 <= i * j <= max_num},
key=lambda x: x[0] * x[1],
)
r = _closest_ratio(w / h, ratios, w, h, sz)
resized = img.resize((sz * r[0], sz * r[1]))
tiles = [
resized.crop(((i % r[0]) * sz, (i // r[0]) * sz,
(i % r[0] + 1) * sz, (i // r[0] + 1) * sz))
for i in range(r[0] * r[1])
]
if thumb and len(tiles) > 1:
tiles.append(img.resize((sz, sz)))
return tiles
def load_image(path, max_num=12, sz=336):
img = Image.open(path).convert("RGB")
tf = _transform(sz)
return torch.stack([tf(t) for t in _tile(img, max_num, sz)])
pixel_values = load_image("chest_xray.png").to(torch.bfloat16).to(model.device)
answer = model.chat(
tokenizer,
pixel_values=pixel_values,
question="<image>\nWrite the Findings section of this chest X-ray. Be concise and clinical.",
generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)
Multiple images per turn
For prompts referencing more than one image (e.g. a current and a prior
study), tile each image separately and concatenate the tensors. Put one
<image> placeholder per image, in the same order:
pv_current = load_image("current.jpg", max_num=8)
pv_prior = load_image("prior.jpg", max_num=8)
pixel_values = torch.cat([pv_current, pv_prior], dim=0).to(torch.bfloat16).to(model.device)
answer = model.chat(
tokenizer,
pixel_values=pixel_values,
question=(
"Given the current AP image <image> and the prior AP image <image>, "
"describe interval changes."
),
generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)
Multiple-choice helper
For MCQ benchmarks the model emits a single letter:
import re
ans = model.chat(
tokenizer,
pixel_values=pixel_values,
question="<image>\nIs there evidence of an aortic aneurysm?\n"
"Answer Choices: (A) Yes, (B) No.\nPlease select the correct option.",
generation_config=dict(max_new_tokens=8, do_sample=False),
)
letter = re.search(r"\b([A-E])\b", ans).group(1)
Evaluation
All numbers below are from the MedGPT-oss paper. We compare MedGPT-oss-20B against contemporaneous open-weight medical foundation models. Bold = best in row.
Medical Visual Question Answering (accuracy)
| Dataset | MedGPT-oss-20B | OctoMed-7B | Hulu-Med-32B | Lingshu-32B | MedGemma-27B | QoQ-Med-32B |
|---|---|---|---|---|---|---|
| MedXQA (multimodal) | 49.23 | 25.60 | 34.35 | 31.43 | 30.90 | 29.64 |
| SLAKE | 71.53 | 65.07 | 69.14 | 72.24 | 55.98 | 46.53 |
| MedFrameQA | 63.01 | 42.82 | 62.82 | 61.01 | 47.63 | 55.73 |
| MMMU-Med (dev) | 61.49 | 47.65 | 57.71 | 59.43 | 47.43 | 51.84 |
| MMMU-Med-Pro (4 opt.) | 52.34 | 44.62 | 52.45 | 52.67 | 45.80 | 46.93 |
| MMMU-Med-Pro (10 opt.) | 39.94 | 23.07 | 37.41 | 43.45 | 36.71 | 38.12 |
Text-only medical QA (accuracy)
| Dataset | MedGPT-oss-20B | OctoMed-7B | Hulu-Med-32B | Lingshu-32B | MedGemma-27B | QoQ-Med-32B |
|---|---|---|---|---|---|---|
| MedQA | 70.11 | 44.97 | 77.18 | 70.71 | 66.75 | 55.25 |
| PubMedQA | 57.81 | 48.31 | 61.00 | 62.44 | 55.80 | 42.80 |
| MedMCQA | 62.53 | 55.32 | 72.75 | 65.27 | 65.48 | 51.42 |
| MedXQA (text) | 25.38 | 10.86 | 23.47 | 21.47 | 14.37 | 8.78 |
| MMLU-Med | 72.59 | 61.65 | 87.10 | 82.68 | 80.65 | 74.98 |
| MedBullets | 68.71 | 32.21 | 67.45 | 58.69 | 51.34 | 37.25 |
Chest X-ray report generation (MIMIC-CXR)
| Metric | MedGPT-oss-20B | OctoMed-7B | Hulu-Med-32B | Lingshu-32B | MedGemma-27B | QoQ-Med-32B |
|---|---|---|---|---|---|---|
| RadGraph-F1 | 0.189 | 0.129 | 0.267 | 0.278 | 0.095 | 0.166 |
| RaTEScore | 0.522 | 0.402 | 0.575 | 0.590 | 0.291 | 0.482 |
| 1 / RadCliQ-v1 | 0.803 | 0.615 | 0.867 | 0.881 | 0.505 | 0.748 |
Clinical in-context learning
Scores are accuracy (%) for Patient–trial matching and RaTEScore for Impression generation.
| Task | Metric | Setting | MedGPT-oss-20B | OctoMed-7B | Hulu-Med-32B | Lingshu-32B | MedGemma-27B | QoQ-Med-32B |
|---|---|---|---|---|---|---|---|---|
| Patient–trial | Acc. (%) | 0-shot | 48.81 | 40.96 | 51.01 | 52.07 | 31.03 | 45.20 |
| Patient–trial | Acc. (%) | 1-shot | 55.60 | 40.02 | 47.00 | 48.91 | 52.24 | 47.41 |
| Impression | RaTEScore | 0-shot | 47.22 | 31.04 | 43.14 | 43.80 | 38.42 | 41.44 |
| Impression | RaTEScore | 1-shot | 47.25 | 30.91 | 41.52 | 40.27 | 38.29 | 40.71 |
Notes
- Do not use
AutoProcessor. This repo does not register an image processor. - EOS / stop token:
<|return|>(id 200002).model.chatsets it automatically. postprocess=Falsereturns the raw<|channel|>final<|message|>…<|end|>envelope — useful if you want to inspect Harmony channels yourself.- Continued training / fine-tuning: set
model.config.llm_config.output_router_logits = Truebefore.forward()so the MoE load-balancing aux loss is computed. The shipped config defaults this toFalsefor inference.
Citation
@article{zhang2026medgptoss,
title = {MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine},
author = {Zhang, Kai and Yuan, Zhengqing and Peng, Cheng and Zhao, Songlin and
Lyu, Mengxian and Chen, Ziyi and Ye, Yanfang and Liu, Wei and
Zhang, Ying and Smith, Kaleb E. and He, Lifang and Sun, Lichao and Wu, Yonghui},
journal = {arXiv preprint arXiv:2603.00842},
year = {2026}
}
Contact
Lichao Sun (lis221@lehigh.edu) · Yonghui Wu (yonghui.wu@ufl.edu)
- Downloads last month
- 49