Instructions to use stanfordmimi/MedVAL-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stanfordmimi/MedVAL-4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="stanfordmimi/MedVAL-4B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("stanfordmimi/MedVAL-4B")
model = AutoModelForCausalLM.from_pretrained("stanfordmimi/MedVAL-4B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use stanfordmimi/MedVAL-4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stanfordmimi/MedVAL-4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stanfordmimi/MedVAL-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/stanfordmimi/MedVAL-4B

SGLang

How to use stanfordmimi/MedVAL-4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stanfordmimi/MedVAL-4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stanfordmimi/MedVAL-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stanfordmimi/MedVAL-4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stanfordmimi/MedVAL-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use stanfordmimi/MedVAL-4B with Docker Model Runner:
```
docker model run hf.co/stanfordmimi/MedVAL-4B
```

nielsr HF Staff commited on Jul 15, 2025

Commit

4e430ff

verified ·

1 Parent(s): 5c24e23

Improve model card: Add pipeline tag, project page, and abstract

Browse files

This PR improves the model card for MedVAL-4B by:
- Adding the `pipeline_tag: text-classification` to the metadata, which helps with model discoverability on the Hugging Face Hub.
- Including a link to the project page (`https://stanfordmimi.github.io/MedVAL/`) in the "Sources" section for comprehensive referencing.
- Adding a dedicated "Abstract" section with the paper's abstract to provide a more thorough overview of the model and its context.

Files changed (1) hide show

README.md +22 -8

README.md CHANGED Viewed

@@ -1,31 +1,41 @@
 ---
-license: mit
 datasets:
 - stanfordmimi/MedVAL-Bench
 language:
 - en
 - es
 metrics:
 - f1
 - accuracy
-base_model:
-- Qwen/Qwen3-4B
-library_name: transformers
 tags:
 - medical
 ---
 **MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bac7c5e38420aaba8ea197/hBt_BPI6PeW_lv-HbCHE6.png)
 [![arXiv](https://img.shields.io/badge/arXiv-2507.03152-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2507.03152)
 **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
 # Sources
 - **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
 - **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
 - **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
 # Model Details
@@ -104,8 +114,10 @@ Your output fields are:
     Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
     Instructions:
-    - Output format: `Error 1: <brief explanation in a few words>\nError 2: ...'
-    - Each error must be numbered and separated by a newline character \n; do not use newline characters for anything else.
     - Return `None' if no errors are found.
     - Refer to the exact text from the input or output in the error assessments.
@@ -178,8 +190,10 @@ try:
 except ValueError:
     index = 0
-thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
-content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
 print("thinking content:", thinking_content)
 print("content:", content)

 ---
+base_model:
+- Qwen/Qwen3-4B
 datasets:
 - stanfordmimi/MedVAL-Bench
 language:
 - en
 - es
+library_name: transformers
+license: mit
 metrics:
 - f1
 - accuracy
 tags:
 - medical
+pipeline_tag: text-classification
 ---
+# MedVAL-4B
 **MedVAL-4B** (medical text validator) is a language model fine-tuned to **assess AI-generated medical text** outputs at near **physician-level reliability**.
+MedVAL is a self-supervised framework for expert-level validation of AI-generated medical text using language models. The system is designed to evaluate the accuracy and safety of AI-generated medical text across multiple medical tasks. The framework supports both model fine-tuning and evaluation.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64bac7c5e38420aaba8ea197/hBt_BPI6PeW_lv-HbCHE6.png)
 [![arXiv](https://img.shields.io/badge/arXiv-2507.03152-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2507.03152)
 **Figure 1** | **MedVAL test-time workflow**. A generator LM produces an output, and MedVAL assesses the output's factual consistency with the input, while assigning a risk grade and determining its safety for deployment.
+## Abstract
+With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
 # Sources
 - **Paper:** [Expert-level validation of AI-generated medical text with scalable language models](https://www.arxiv.org/abs/2507.03152)
 - **Code:** [GitHub](https://github.com/StanfordMIMI/MedVAL)
+- **Project Page:** [MedVAL Project Page](https://stanfordmimi.github.io/MedVAL/)
 - **Train/Test Dataset:** [MedVAL-Bench](https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)
 # Model Details
     Evaluate the output in comparison to the input and determine errors that exhibit factual inconsistency with the input.
     Instructions:
+    - Output format: `Error 1: <brief explanation in a few words>
+Error 2: ...'
+    - Each error must be numbered and separated by a newline character
+; do not use newline characters for anything else.
     - Return `None' if no errors are found.
     - Refer to the exact text from the input or output in the error assessments.
 except ValueError:
     index = 0
+thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("
+")
+content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("
+")
 print("thinking content:", thinking_content)
 print("content:", content)