MedLayEval

MedLayEval is a distilled multimodal evaluator for medical lay-language generation. Given a triple (medical image, expert caption, candidate lay caption), it returns five attribute scores in [0, 1] plus their mean overall score, and serves as the headline metric for the MedLayXPlain benchmark.

The model is a Qwen2.5-VL-3B-Instruct backbone with LoRA adapters and a small attention-mask-pooled regression head trained by distillation from a stronger judge.

The five attributes

Attribute	What it scores
`modality`	Correctly identifies the imaging modality (CT, MRI, histology, …)
`anatomy`	Correctly identifies the depicted anatomy / region
`finding`	Correctly conveys the radiological / pathological finding
`factual`	Factually consistent with the expert caption and image
`readability`	Written in patient-facing lay language (no jargon)

The overall score reported on the MedLayXPlain leaderboard is the mean of the five attributes.

Files

adapter_config.json
adapter_model.safetensors     # PEFT LoRA, r=16, alpha=32, dropout=0.05
                              # targets q/k/v/o + gate/up/down projections
regression_head.pt            # 2-layer MLP head: 2048 -> 256 -> 5 (+ Sigmoid)
config.json                   # base model config (Qwen2.5-VL-3B-Instruct)
generation_config.json
preprocessor_config.json
video_preprocessor_config.json
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt
added_tokens.json, special_tokens_map.json, chat_template.jinja
model.py                      # VLMRegressor module (importable)
inference_example.py          # minimal usage example

The base model weights are not redistributed — adapter_config.json points at Qwen/Qwen2.5-VL-3B-Instruct, which is fetched from the Hub at load time. Users must accept the Qwen license for the base weights separately; the LoRA + head weights in this repo are released under Apache 2.0.

Quick start

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel

from model import VLMRegressor, ATTRS  # shipped in this repo

BASE = "Qwen/Qwen2.5-VL-3B-Instruct"
CKPT = "."  # this repo, after `huggingface_hub.snapshot_download`

device = "cuda:0"
processor = AutoProcessor.from_pretrained(BASE, max_pixels=448 * 448)
if processor.tokenizer.pad_token is None:
    processor.tokenizer.pad_token = processor.tokenizer.eos_token

base = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
vlm = PeftModel.from_pretrained(base, CKPT).merge_and_unload()
hidden = vlm.config.hidden_size if hasattr(vlm.config, "hidden_size") else vlm.config.text_config.hidden_size

model = VLMRegressor(vlm, hidden).to(device, dtype=torch.bfloat16)
model.head.load_state_dict(torch.load(f"{CKPT}/regression_head.pt", map_location=device))
model.head = model.head.to(device, dtype=torch.float32)
model.eval()

image = Image.open("example.png").convert("RGB")
expert = "Axial chest CT showing a 1.2 cm spiculated nodule in the right upper lobe ..."
lay    = "The scan shows a small spot in the upper part of the right lung that ..."
user_text = f"<expert>{expert[:1500]}</expert>\n<lay>{lay[:1500]}</lay>"

messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": user_text},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
inputs = processor(text=[text], images=[image], padding=True, truncation=True,
                   max_length=2048, return_tensors="pt").to(device)

with torch.no_grad():
    scores = model(**inputs).cpu().float().numpy()[0]

print({a: float(s) for a, s in zip(ATTRS, scores)})
print("overall:", float(scores.mean()))

inference_example.py runs the same flow end-to-end on dummy inputs.

Training (high level)

Base: Qwen/Qwen2.5-VL-3B-Instruct.
Adapters: LoRA r=16, α=32, dropout=0.05, on q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
Head: Linear(2048, 256) → GELU → Dropout(0.1) → Linear(256, 5) → Sigmoid on attention-mask-pooled last-hidden-state.
Loss: MSE against the five attribute scores produced by a larger judge on the MedLayXPlain training partition.
Decoding at inference: no generation; only one forward pass to read the pooled hidden state.

Full data construction, training, and validation details are in the appendix of the submission and in the public MedLayXPlain repository.

Intended use & limitations

Intended: as an automatic, ground-truth-free metric for ranking lay-language outputs of medical VLMs that already have a matched expert caption available.
Not intended: as a clinical assessment tool. The five attributes measure agreement with expert text, not real-world clinical correctness or patient outcomes.
The model is trained on captions paired with images from MedTrinity-25M; out-of-distribution modalities or text styles may degrade scores.
Calibration: scores are not probabilities. They are bounded to [0, 1] by the final sigmoid but the ranking, not the absolute value, is what has been validated.

Citation

@inproceedings{anonymous2026medlayxplain,
  title  = {MedLayXPlain: A Benchmark and Distilled Evaluator for Medical Lay-Language Generation},
  author = {Anonymous},
  booktitle = {NeurIPS Datasets and Benchmarks},
  year   = {2026}
}

License

LoRA adapter weights (adapter_*), regression head (regression_head.pt), model.py, and inference_example.py: Apache 2.0.
Tokenizer / processor files are copied from the base model and remain under the Qwen license.
Use of this checkpoint requires the base model Qwen/Qwen2.5-VL-3B-Instruct, which has its own license. Users are responsible for accepting it separately.

Downloads last month: 9

Model tree for anonymous-medical/MedLayEval

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

(167)

this model