stanfordmimi
/

MedVAL-4B

Text Generation

text-generation-inference

Model card Files Files and versions

asadaali commited on Oct 2, 2025

Commit

39a1ae1

·

verified ·

1 Parent(s): 7965be9

Update README.md

Files changed (1) hide show

README.md +0 -4

README.md CHANGED Viewed

@@ -24,10 +24,6 @@ tags:
 **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
-## Abstract
-With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) [codebase](https://github.com/StanfordMIMI/MedVAL), 2) [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), and 3) [MedVAL-4B](https://huggingface.co/stanfordmimi/MedVAL-4B), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
 # Sources
 - **Paper:** [Toward expert-level medical text validation with language models](https://www.arxiv.org/abs/2507.03152)

 **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
 # Sources
 - **Paper:** [Toward expert-level medical text validation with language models](https://www.arxiv.org/abs/2507.03152)