Text Generation
Transformers
Safetensors
English
Spanish
qwen3
medical
conversational
text-generation-inference
Instructions to use stanfordmimi/MedVAL-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use stanfordmimi/MedVAL-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="stanfordmimi/MedVAL-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("stanfordmimi/MedVAL-4B") model = AutoModelForCausalLM.from_pretrained("stanfordmimi/MedVAL-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use stanfordmimi/MedVAL-4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "stanfordmimi/MedVAL-4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stanfordmimi/MedVAL-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/stanfordmimi/MedVAL-4B
- SGLang
How to use stanfordmimi/MedVAL-4B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "stanfordmimi/MedVAL-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stanfordmimi/MedVAL-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "stanfordmimi/MedVAL-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stanfordmimi/MedVAL-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use stanfordmimi/MedVAL-4B with Docker Model Runner:
docker model run hf.co/stanfordmimi/MedVAL-4B
Improve model card: Add pipeline tag, project page, and abstract
Browse filesThis PR improves the model card for MedVAL-4B by:
- Adding the `pipeline_tag: text-classification` to the metadata, which helps with model discoverability on the Hugging Face Hub.
- Including a link to the project page (`https://stanfordmimi.github.io/MedVAL/`) in the "Sources" section for comprehensive referencing.
- Adding a dedicated "Abstract" section with the paper's abstract to provide a more thorough overview of the model and its context.
README.md
CHANGED
|
@@ -1,31 +1,41 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- stanfordmimi/MedVAL-Bench
|
| 5 |
language:
|
| 6 |
- en
|
| 7 |
- es
|
|
|
|
|
|
|
| 8 |
metrics:
|
| 9 |
- f1
|
| 10 |
- accuracy
|
| 11 |
-
base_model:
|
| 12 |
-
- Qwen/Qwen3-4B
|
| 13 |
-
library_name: transformers
|
| 14 |
tags:
|
| 15 |
- medical
|
|
|
|
| 16 |
---
|
| 17 |
|
|
|
|
|
|
|
| 18 |
**MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
|
| 19 |
|
|
|
|
|
|
|
| 20 |

|
| 21 |
[](https://arxiv.org/abs/2507.03152)
|
| 22 |
|
| 23 |
**Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
# Sources
|
| 26 |
|
| 27 |
- **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
|
| 28 |
- **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
|
|
|
|
| 29 |
- **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
|
| 30 |
|
| 31 |
# Model Details
|
|
@@ -104,8 +114,10 @@ Your output fields are:
|
|
| 104 |
Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
|
| 105 |
|
| 106 |
Instructions:
|
| 107 |
-
- Output format: `Error 1: <brief explanation in a few words>
|
| 108 |
-
|
|
|
|
|
|
|
| 109 |
- Return `None' if no errors are found.
|
| 110 |
- Refer to the exact text from the input or output in the error assessments.
|
| 111 |
|
|
@@ -178,8 +190,10 @@ try:
|
|
| 178 |
except ValueError:
|
| 179 |
index = 0
|
| 180 |
|
| 181 |
-
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
|
| 182 |
-
|
|
|
|
|
|
|
| 183 |
|
| 184 |
print("thinking content:", thinking_content)
|
| 185 |
print("content:", content)
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-4B
|
| 4 |
datasets:
|
| 5 |
- stanfordmimi/MedVAL-Bench
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
- es
|
| 9 |
+
library_name: transformers
|
| 10 |
+
license: mit
|
| 11 |
metrics:
|
| 12 |
- f1
|
| 13 |
- accuracy
|
|
|
|
|
|
|
|
|
|
| 14 |
tags:
|
| 15 |
- medical
|
| 16 |
+
pipeline_tag: text-classification
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# MedVAL-4B
|
| 20 |
+
|
| 21 |
**MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
|
| 22 |
|
| 23 |
+
MedVAL is a self-supervised framework for expert-level validation of AI-generated medical text using language models. The system is designed to evaluate the accuracy and safety of AI-generated medical text across multiple medical tasks. The framework supports both model fine-tuning and evaluation.
|
| 24 |
+
|
| 25 |

|
| 26 |
[](https://arxiv.org/abs/2507.03152)
|
| 27 |
|
| 28 |
**Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
|
| 29 |
|
| 30 |
+
## Abstract
|
| 31 |
+
|
| 32 |
+
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
|
| 33 |
+
|
| 34 |
# Sources
|
| 35 |
|
| 36 |
- **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
|
| 37 |
- **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
|
| 38 |
+
- **Project Page:** [MedVAL Project Page](https://stanfordmimi.github.io/MedVAL/)
|
| 39 |
- **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
|
| 40 |
|
| 41 |
# Model Details
|
|
|
|
| 114 |
Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
|
| 115 |
|
| 116 |
Instructions:
|
| 117 |
+
- Output format: `Error 1: <brief explanation in a few words>
|
| 118 |
+
Error 2: ...'
|
| 119 |
+
- Each error must be numbered and separated by a newline character
|
| 120 |
+
; do not use newline characters for anything else.
|
| 121 |
- Return `None' if no errors are found.
|
| 122 |
- Refer to the exact text from the input or output in the error assessments.
|
| 123 |
|
|
|
|
| 190 |
except ValueError:
|
| 191 |
index = 0
|
| 192 |
|
| 193 |
+
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
|
| 194 |
+
")
|
| 195 |
+
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("
|
| 196 |
+
")
|
| 197 |
|
| 198 |
print("thinking content:", thinking_content)
|
| 199 |
print("content:", content)
|