MedPMC Initial Screening Model

This model is used in the initial screening stage of the MedPMC data curation pipeline. It is a text-based classifier that takes a figure caption and its inline reference text from a PubMed Central article as input and predicts whether the corresponding figure is likely to be a clinically relevant medical image for downstream multimodal data curation.

The model is initialized from michiyasunaga/BioLinkBERT-base.

Task

The model performs binary classification.

Label Meaning
0 Non-medical
1 Medical

Input format

This repository corresponds to the caption only version of the initial screening model. The input text should concatenate the figure caption and inline reference text using the following format:

"Caption": {figure_caption}

Quick start

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

repo_id = "Yale-BIDS-Chen/medpmc-screening-biolinkbert-caption-only"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
model.eval()

caption = "Axial CT image showing a pulmonary nodule in the right upper lobe."

text = '"Caption": ' + caption

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=512,
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    pred = torch.argmax(probs, dim=-1).item()

print("Prediction:", pred)
print("Probabilities:", probs.tolist())

Model Performance

MedPMC includes multiple initial screening variants depending on the input text and model backbone. The table below summarizes the performance of different screening models evaluated on the MedPMC validation set.

Model Input Precision Recall F1
Keyword match Caption + inline text; Keywords 70.6 62.2 61.7
Bioformer-16L Caption only 92.2 92.3 92.2
Bioformer-16L Caption + inline text 93.0 92.9 92.9
Bioformer-8L Caption only 92.3 92.3 92.3
Bioformer-8L Caption + inline text 92.3 92.6 92.5
BioLinkBERT-base(This model) Caption only 93.0 92.7 92.9
BioLinkBERT-base Caption + inline text 92.9 93.0 92.9
PubMedBERT-fulltext Caption only 92.7 93.1 92.9
PubMedBERT-fulltext Caption + inline text 93.3 93.1 93.2
Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Yale-BIDS-Chen/medpmc-screening-biolinkbert-caption-only

Finetuned
(23)
this model

Collection including Yale-BIDS-Chen/medpmc-screening-biolinkbert-caption-only