Instructions to use UFNLP/MedGPT-oss with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use UFNLP/MedGPT-oss with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="UFNLP/MedGPT-oss", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("UFNLP/MedGPT-oss", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use UFNLP/MedGPT-oss with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "UFNLP/MedGPT-oss"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UFNLP/MedGPT-oss",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/UFNLP/MedGPT-oss

SGLang

How to use UFNLP/MedGPT-oss with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "UFNLP/MedGPT-oss" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UFNLP/MedGPT-oss",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "UFNLP/MedGPT-oss" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UFNLP/MedGPT-oss",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use UFNLP/MedGPT-oss with Docker Model Runner:
```
docker model run hf.co/UFNLP/MedGPT-oss
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Access to MedGPT-oss is restricted to academic and non-profit research. Commercial use, clinical decision-making, and integration into deployed clinical systems are prohibited. The model was trained in part on data subject to use restrictions; by requesting access you confirm that you have read the LICENSE and that your intended use is research only.

MedGPT-oss

MedGPT-oss is an open-weight ~20B-parameter vision–language model for biomedicine, built on GPT-oss-20B with a CLIP-ViT-L/14 @ 336 px visual encoder and a two-layer MLP projector. It is trained with a three-stage curriculum (alignment → long-context mid-training → instruction tuning) and is positioned as a research foundation for medical AI work that needs on-premises, privacy-preserving inference.

📄 Paper: arXiv:2603.00842

⚠️ License and usage restrictions (please read before requesting access)

MedGPT-oss was trained in part on data subject to use restrictions. The weights are released under CC BY-NC 4.0 for research use only. By using this model you agree:

No commercial use of any kind (products, services, paid APIs, or internal commercial workflows).
No clinical decision-making. The model must not inform, assist, or replace any real-world diagnosis, treatment plan, or patient-facing service. Outputs are not medical advice.
No re-identification of any individual whose data may have contributed to training.
Use of the Python source files in this repo is permitted under MIT (derived from InternVL, Copyright 2024 OpenGVLab); the source-code MIT license does not override the non-commercial restriction on the weights.

See LICENSE for the full terms.

Intended use

MedGPT-oss is intended as a research foundation for biomedical vision–language work, not as a drop-in clinical tool. Typical uses:

Studying multimodal medical reasoning in a fully open-weight setting.
Benchmarking on medical VQA, multiple-choice QA, and radiology report generation tasks.
Fine-tuning on institutional research datasets where on-premises inference is required.
Comparing against other medical VLMs in a controlled research setting.

Researchers are expected to perform their own validation and adaptation before any use that goes beyond reading benchmark numbers. The model should be treated as a starting point that requires meaningful modification and external validation for any downstream use case.

Out-of-scope use

The following are explicitly prohibited:

Clinical decision-making. Outputs must not inform diagnoses, treatment plans, triage, or any patient-facing decision.
Deployed clinical systems. No integration into EHR systems, clinical workflows, or patient-facing software.
Commercial use of any kind.
Re-identification of individuals from the training distribution.
Multi-turn / agentic applications — the model was trained and evaluated as a single-turn predictor; multi-turn behavior is unvalidated.
Unsupervised consumer use. Outputs are not medical advice and must not be presented as such.

Requirements

pip install "transformers>=4.55" torch torchvision pillow

Weights ship in bf16 (~40 GB on disk). One ≥48 GB GPU (or two 24 GB GPUs with device_map="auto") is comfortable.

Quickstart

import torch
from transformers import AutoModel, AutoTokenizer

MODEL_ID = "UFNLP/MedGPT-oss"

model = AutoModel.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True, use_fast=False)

answer = model.chat(
    tokenizer,
    pixel_values=None,
    question="What is the first-line outpatient antibiotic for community-acquired pneumonia in healthy adults?",
    generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)
# -> "The first-line outpatient antibiotic for community-acquired pneumonia in healthy adults is Amoxicillin."

model.chat() automatically:

formats the prompt with the MedGPT-oss Harmony system message;
generates with the supplied generation_config;
strips the <|channel|>final<|message|> … <|end|> envelope;
removes repeated sentences (regex-based dedup).

Pass postprocess=False to get the raw decoded string.

Image + text

How image input works

MedGPT-oss uses InternVL-style dynamic tiling: an input image is split into up to max_num non-overlapping 336×336 tiles whose aspect ratio best matches the original image, plus one optional 336×336 thumbnail of the full image. Each tile is encoded by the CLIP ViT-L/14 vision tower and fed to the language model as a span of image tokens. Tiling preserves resolution on high-detail medical images (radiographs, pathology slides) without down-sampling to a single 336×336 crop.

To run the model on an image you:

Call load_image(path, max_num=N) to get a (n_tiles, 3, 336, 336) bf16 tensor.
Put one <image> placeholder in your question string for each image. If you pass pixel_values but forget the placeholder, model.chat will prepend one automatically (single-image case only).
Pass the tensor to model.chat(..., pixel_values=...).

max_num is a budget. max_num=12 (the default) lets a 4032×3024 X-ray become 4×3 tiles + 1 thumbnail (13 tiles), each 336². Larger values preserve more detail at the cost of more image tokens (~256 LLM tokens per tile). Use max_num=1 to force a single global view.

The helper below is self-contained — no extra packages.

import torch, torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from PIL import Image

IM_MEAN, IM_STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)

def _transform(sz=336):
    return T.Compose([
        T.Lambda(lambda i: i.convert("RGB") if i.mode != "RGB" else i),
        T.Resize((sz, sz), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(IM_MEAN, IM_STD),
    ])

def _closest_ratio(ar, ratios, w, h, sz):
    best, best_diff = (1, 1), float("inf")
    for r in ratios:
        d = abs(ar - r[0] / r[1])
        if d < best_diff:
            best, best_diff = r, d
        elif d == best_diff and w * h > 0.5 * sz * sz * r[0] * r[1]:
            best = r
    return best

def _tile(img, max_num=12, sz=336, thumb=True):
    w, h = img.size
    ratios = sorted(
        {(i, j) for n in range(1, max_num + 1)
         for i in range(1, n + 1) for j in range(1, n + 1) if 1 <= i * j <= max_num},
        key=lambda x: x[0] * x[1],
    )
    r = _closest_ratio(w / h, ratios, w, h, sz)
    resized = img.resize((sz * r[0], sz * r[1]))
    tiles = [
        resized.crop(((i % r[0]) * sz, (i // r[0]) * sz,
                     (i % r[0] + 1) * sz, (i // r[0] + 1) * sz))
        for i in range(r[0] * r[1])
    ]
    if thumb and len(tiles) > 1:
        tiles.append(img.resize((sz, sz)))
    return tiles

def load_image(path, max_num=12, sz=336):
    img = Image.open(path).convert("RGB")
    tf = _transform(sz)
    return torch.stack([tf(t) for t in _tile(img, max_num, sz)])

pixel_values = load_image("chest_xray.png").to(torch.bfloat16).to(model.device)

answer = model.chat(
    tokenizer,
    pixel_values=pixel_values,
    question="<image>\nWrite the Findings section of this chest X-ray. Be concise and clinical.",
    generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)

Multiple images per turn

For prompts referencing more than one image (e.g. a current and a prior study), tile each image separately and concatenate the tensors. Put one <image> placeholder per image, in the same order:

pv_current = load_image("current.jpg", max_num=8)
pv_prior   = load_image("prior.jpg",   max_num=8)
pixel_values = torch.cat([pv_current, pv_prior], dim=0).to(torch.bfloat16).to(model.device)

answer = model.chat(
    tokenizer,
    pixel_values=pixel_values,
    question=(
        "Given the current AP image <image> and the prior AP image <image>, "
        "describe interval changes."
    ),
    generation_config=dict(max_new_tokens=384, do_sample=False, repetition_penalty=1.05),
)
print(answer)

Multiple-choice helper

For MCQ benchmarks the model emits a single letter:

import re
ans = model.chat(
    tokenizer,
    pixel_values=pixel_values,
    question="<image>\nIs there evidence of an aortic aneurysm?\n"
             "Answer Choices: (A) Yes, (B) No.\nPlease select the correct option.",
    generation_config=dict(max_new_tokens=8, do_sample=False),
)
letter = re.search(r"\b([A-E])\b", ans).group(1)

Evaluation

All numbers below are from the MedGPT-oss paper. We compare MedGPT-oss-20B against contemporaneous open-weight medical foundation models. Bold = best in row.

Medical Visual Question Answering (accuracy)

Dataset	MedGPT-oss-20B	OctoMed-7B	Hulu-Med-32B	Lingshu-32B	MedGemma-27B	QoQ-Med-32B
MedXQA (multimodal)	49.23	25.60	34.35	31.43	30.90	29.64
SLAKE	71.53	65.07	69.14	72.24	55.98	46.53
MedFrameQA	63.01	42.82	62.82	61.01	47.63	55.73
MMMU-Med (dev)	61.49	47.65	57.71	59.43	47.43	51.84
MMMU-Med-Pro (4 opt.)	52.34	44.62	52.45	52.67	45.80	46.93
MMMU-Med-Pro (10 opt.)	39.94	23.07	37.41	43.45	36.71	38.12

Text-only medical QA (accuracy)

Dataset	MedGPT-oss-20B	OctoMed-7B	Hulu-Med-32B	Lingshu-32B	MedGemma-27B	QoQ-Med-32B
MedQA	70.11	44.97	77.18	70.71	66.75	55.25
PubMedQA	57.81	48.31	61.00	62.44	55.80	42.80
MedMCQA	62.53	55.32	72.75	65.27	65.48	51.42
MedXQA (text)	25.38	10.86	23.47	21.47	14.37	8.78
MMLU-Med	72.59	61.65	87.10	82.68	80.65	74.98
MedBullets	68.71	32.21	67.45	58.69	51.34	37.25

Chest X-ray report generation (MIMIC-CXR)

Metric	MedGPT-oss-20B	OctoMed-7B	Hulu-Med-32B	Lingshu-32B	MedGemma-27B	QoQ-Med-32B
RadGraph-F1	0.189	0.129	0.267	0.278	0.095	0.166
RaTEScore	0.522	0.402	0.575	0.590	0.291	0.482
1 / RadCliQ-v1	0.803	0.615	0.867	0.881	0.505	0.748

Clinical in-context learning

Scores are accuracy (%) for Patient–trial matching and RaTEScore for Impression generation.

Task	Metric	Setting	MedGPT-oss-20B	OctoMed-7B	Hulu-Med-32B	Lingshu-32B	MedGemma-27B	QoQ-Med-32B
Patient–trial	Acc. (%)	0-shot	48.81	40.96	51.01	52.07	31.03	45.20
Patient–trial	Acc. (%)	1-shot	55.60	40.02	47.00	48.91	52.24	47.41
Impression	RaTEScore	0-shot	47.22	31.04	43.14	43.80	38.42	41.44
Impression	RaTEScore	1-shot	47.25	30.91	41.52	40.27	38.29	40.71

Notes

Do not use AutoProcessor. This repo does not register an image processor.
EOS / stop token: <|return|> (id 200002). model.chat sets it automatically.
postprocess=False returns the raw <|channel|>final<|message|>…<|end|> envelope — useful if you want to inspect Harmony channels yourself.
Continued training / fine-tuning: set model.config.llm_config.output_router_logits = True before .forward() so the MoE load-balancing aux loss is computed. The shipped config defaults this to False for inference.

Citation

@article{zhang2026medgptoss,
  title   = {MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine},
  author  = {Zhang, Kai and Yuan, Zhengqing and Peng, Cheng and Zhao, Songlin and
             Lyu, Mengxian and Chen, Ziyi and Ye, Yanfang and Liu, Wei and
             Zhang, Ying and Smith, Kaleb E. and He, Lifang and Sun, Lichao and Wu, Yonghui},
  journal = {arXiv preprint arXiv:2603.00842},
  year    = {2026}
}

Contact

Lichao Sun (lis221@lehigh.edu) · Yonghui Wu (yonghui.wu@ufl.edu)

Downloads last month: 49

Safetensors

Model size

21B params

Tensor type

BF16

Paper for UFNLP/MedGPT-oss

MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine

Paper • 2603.00842 • Published Mar 1