Update README.md
Browse files
README.md
CHANGED
|
@@ -24,10 +24,6 @@ tags:
|
|
| 24 |
|
| 25 |
**Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
|
| 26 |
|
| 27 |
-
## Abstract
|
| 28 |
-
|
| 29 |
-
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) [codebase](https://github.com/StanfordMIMI/MedVAL), 2) [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), and 3) [MedVAL-4B](https://huggingface.co/stanfordmimi/MedVAL-4B), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
|
| 30 |
-
|
| 31 |
# Sources
|
| 32 |
|
| 33 |
- **Paper:** [Toward expert-level medical text validation with language models](https://www.arxiv.org/abs/2507.03152)
|
|
|
|
| 24 |
|
| 25 |
**Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
# Sources
|
| 28 |
|
| 29 |
- **Paper:** [Toward expert-level medical text validation with language models](https://www.arxiv.org/abs/2507.03152)
|