Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Overview
Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second case study showing that Med-V1 can automatically identify high-stakes misattributions in clinical guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. In conclusion, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks.
Use Med-V1
Please note that Med-V1 only classifies whether an assertion can be supported by a given source, rather than classifying its factual validity. For example, a "true" claim can still be refuted by an article that shows conflicting data from a small-scale study, and a "false" claim can still be supported by an article that discusses a potential biological mechanism. As such, Med-V1's predictions of support and refutation are entirely dependent on the provided source evidence and should not be interpreted as a universal factuality label. Like all AI models, Med-V1 output can contain inaccuracies and does not reflect the views of the authors or their employers.
Prerequisites
- Python 3.8+ (3.11.7 is used in the work)
torch>=2.1.0(latest version is recommended)transformers>=4.51.0(latest version is recommended)
Essentially, Med-V1 verifies an assertion against a source. The assertion can be a claim about the effectiveness of a treatment, and in this case, the source can be the PubMed abstract reporting the clinical trial that tests the treatment.
There are two variants of Med-V1:
- Med-V1-L3B, which is Llama-3.2-3B-Instruct fine-tuned with MedFact-Synth
- Med-V1-Q3B, which is Qwen2.5-3B-Instruct fine-tuned with MedFact-Synth
They have similar performance in our evaluations, and the demonstrations below will be based on Med-V1-L3B.
Quick start
Here is a self-contained code snippet for running Med-V1. You can also try it in Google Colab. Using a modern GPU (e.g., Nvidia A100), the demonstration run should finish in several seconds once the model is downloaded.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import re
model_path = "ncbi/Med-V1-L3B"
# 1. loading the Med-V1(-L3B) model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_path,
cache_dir="./med_v1_model", # change it accordingly
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Ensure pad token is set
if not tokenizer.pad_token:
tokenizer.pad_token_id = tokenizer.eos_token_id
# 2. Initialize Pipeline
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
# 3. Preparing the messages
# The official system prompt of Med-V1.
medv1_system_prompt = """You are a fact-checking expert trained in evidence-based medicine. Your task is to evaluate how strongly an *article* agrees or disagrees with a *claim*. The *article* is retrieved from a search engine using the *claim* as the query.
Use the following five-point scale:
- **-2 Strong Contradiction** – The article clearly and directly refutes the claim.
- **-1 Partial Contradiction** – The article provides mixed or indirect evidence against the claim.
- ** 0 Neutral / Unrelated** – The article does not address the claim, offers insufficient information, or is irrelevant to the claim.
- ** 1 Partial Agreement** – The article offers some indirect or tentative support for the claim.
- ** 2 Strong Agreement** – The article explicitly and strongly supports the claim.
Note that the *article* might not describe the exact same subjects, interventions, or measurements as the *claim*. In this case, please note the difference and assign a score of 0.
Output in two parts only and do not output anything else:
<think>[your detailed, step‐by‐step explanation for scoring]</think>
<score>[the integer score only, i.e., -2, -1, 0, 1, or 2]</score>"""
# Put your custom source and assertion into this syntax: f"Article:\n{source}\n\nClaim:\n{assertion}"
medv1_user_prompt = """Article:
Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?
Objective: Recent studies have demonstrated that statins have pleiotropic effects, including anti-inflammatory effects and atrial fibrillation (AF) preventive effects. The objective of this study was to assess the efficacy of preoperative statin therapy in preventing AF after coronary artery bypass grafting (CABG).
Methods: 221 patients underwent CABG in our hospital from 2004 to 2007. 14 patients with preoperative AF and 4 patients with concomitant valve surgery were excluded from this study. Patients were divided into two groups to examine the influence of statins: those with preoperative statin therapy (Statin group, n = 77) and those without it (Non-statin group, n = 126). In addition, patients were divided into two groups to determine the independent predictors for postoperative AF: those with postoperative AF (AF group, n = 54) and those without it (Non-AF group, n = 149). Patient data were collected and analyzed retrospectively.
Results: The overall incidence of postoperative AF was 26%. Postoperative AF was significantly lower in the Statin group compared with the Non-statin group (16% versus 33%, p = 0.005). Multivariate analysis demonstrated that independent predictors of AF development after CABG were preoperative statin therapy (odds ratio [OR] 0.327, 95% confidence interval [CI] 0.107 to 0.998, p = 0.05) and age (OR 1.058, 95% CI 1.004 to 1.116, p = 0.035).
Conclusion: Our study indicated that preoperative statin therapy seems to reduce AF development after CABG.
Claim:
Preoperative statins reduce atrial fibrillation after coronary artery bypass grafting."""
messages = [
{"role": "system", "content": medv1_system_prompt},
{"role": "user", "content": medv1_user_prompt},
]
# 4. Run the inference
print("Generating response...")
with torch.no_grad():
completions = generator(
messages,
do_sample=False, # Greedy decoding for deterministic results
max_new_tokens=1024,
temperature=None,
top_p=None
)
# 5. Extract and Print Results
raw_output = completions[0]["generated_text"][-1]["content"]
print(raw_output)
# Expected output:
# <think>The article directly investigates the relationship between preoperative statin therapy and the incidence of atrial fibrillation (AF) after coronary artery bypass grafting (CABG). The results presented in the article show that the incidence of postoperative AF is significantly lower in patients who received preoperative statin therapy compared to those who did not (16% vs. 33%, p = 0.005). Furthermore, the multivariate analysis identifies preoperative statin therapy as an independent predictor of reduced AF development after CABG (odds ratio 0.327, p = 0.05). This strong evidence supports the claim that preoperative statins reduce atrial fibrillation after CABG. Therefore, the article explicitly and strongly supports the claim. Given this analysis, I would assign a score of 2 for strong agreement.</think>
# <score>2</score>
Acknowledgments
This research was supported by the Intramural Research Program of the National Institutes of Health (NIH). The contributions of the NIH author(s) are considered Works of the United States Government. This research was also partially supported by the NIH Pathway to Independence Award K99LM014903 (Q.J.), as well as R01LM014344 (Y.P.) and R01LM014573 (Y.P.). The findings and conclusions presented in this paper are those of the author(s) and do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services.
Disclaimer
This tutorial shows the results of research conducted in the Division of Intramural Research, National Library of Medicine, NIH. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tutorial. If you have questions about the information produced on this website, please see a health care professional. More information about NLM's disclaimer policy is available.
Citation
If you find this repo helpful, please cite Med-V1 by:
@article{jin2026medv1,
title={Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution},
author={Jin, Qiao and Fang, Yin and He, Lauren and Yang, Yifan and Xiong, Guangzhi and Wang, Zhizheng and Wan, Nicholas and Chan, Joey and Comeau, Donald C. and Leaman, Robert and Floudas, Charalampos S. and Zhang, Aidong and Chiang, Michael F. and Peng, Yifan and Lu, Zhizong},
year={2026}
}
- Downloads last month
- 19