MedPMC Multi-Figure Detection Model: ViT

This repository provides the vision transformer-based multi-figure detection model used in the MedPMC data curation pipeline.

The model is a binary image classifier trained to predict whether a biomedical figure is a multi-panel / compound figure or a single-panel figure. It is intended for processing figures from biomedical literature, especially figures from PubMed Central (PMC) articles.

Task

The model performs binary image classification.

0: single-panel figure
1: multi-panel / compound figure

Usage

import torch
import timm
from PIL import Image
from timm.data import resolve_data_config, create_transform

checkpoint_path = "model.pth.tar"
image_path = "example.jpg"

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = torch.load(checkpoint_path, map_location="cpu")
arch = checkpoint["arch"]
state_dict = checkpoint["state_dict"]

# Remove DataParallel/DDP prefix if present.
state_dict = {
    k.replace("module.", "", 1) if k.startswith("module.") else k: v
    for k, v in state_dict.items()
}

# Binary classifier.
model = timm.create_model(
    arch,
    pretrained=False,
    num_classes=2,
)

model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()

data_config = resolve_data_config({}, model=model)
preprocess = create_transform(
    **data_config,
    is_training=False,
)

image = Image.open(image_path).convert("RGB")
inputs = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(inputs)
    probs = torch.softmax(logits, dim=-1)
    pred = torch.argmax(probs, dim=-1).item()

print("Prediction:", pred)
print("Probabilities:", probs.cpu().tolist())

Example output:

Prediction: 1
Probabilities: [[0.08, 0.92]]

This means that the model predicts the input image as a multi-panel / compound figure.

Batch Inference

import torch
import timm
from PIL import Image
from pathlib import Path
from timm.data import resolve_data_config, create_transform

checkpoint_path = "model.pth.tar"
image_dir = "sample"

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = torch.load(checkpoint_path, map_location="cpu")
arch = checkpoint["arch"]
state_dict = checkpoint["state_dict"]

state_dict = {
    k.replace("module.", "", 1) if k.startswith("module.") else k: v
    for k, v in state_dict.items()
}

model = timm.create_model(arch, pretrained=False, num_classes=2)
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()

data_config = resolve_data_config({}, model=model)
preprocess = create_transform(
    **data_config,
    is_training=False,
)

image_paths = sorted(
    list(Path(image_dir).glob("*.jpg")) +
    list(Path(image_dir).glob("*.jpeg")) +
    list(Path(image_dir).glob("*.png"))
)

for image_path in image_paths:
    image = Image.open(image_path).convert("RGB")
    inputs = preprocess(image).unsqueeze(0).to(device)

    with torch.no_grad():
        logits = model(inputs)
        probs = torch.softmax(logits, dim=-1)
        pred = torch.argmax(probs, dim=-1).item()

    print("Image:", image_path)
    print("Prediction:", pred)
    print("Probabilities:", probs.cpu().tolist())

License

The model is released for non-commercial research use under CC BY-NC-SA 4.0.

Citation

@article{kim2026medpmc,
  title={MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models},
  author={Kim, Hyunjae and Kim, Dain and Xiao, Pan and Applebaum, Serina S and Chung, Younjoon and Ai, Xuguang and Yin, Yu and Jiang, Roy and Du, Yuexi and Wei, Yawen and others},
  journal={arXiv preprint arXiv:2607.07673},
  year={2026}
}

Questions?

For questions or feedback, please contact Hyunjae Kim at hyunjae.kim@yale.edu.

Downloads last month: -

Collection including Yale-BIDS-Chen/medpmc-multi-fig-detection-vit

MedPMC

Collection

MedPMC resources, including the data curation pipeline, curated datasets, and trained vision-language models. • 22 items • Updated 18 days ago • 4

Paper for Yale-BIDS-Chen/medpmc-multi-fig-detection-vit

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Paper • 2607.07673 • Published 24 days ago • 14