MedSSR-Qwen3-8B-Base

This is the model for our ACL 2026 Findings paper, "Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach". MedSSR-Qwen3-8B-Base is a medical reasoning-focused LLM built from Qwen/Qwen3-8B-Base.

Model Summary

Base model: Qwen/Qwen3-8B-Base
Model name: MedSSR-Qwen3-8B-Base
Training framework: verl
Paper link: https://arxiv.org/pdf/2604.11547
Github repo: https://github.com/tdlhl/MedSSR
Hugging Face training dataset: tdlhl/MedSSR-Synthetic-43K
Hugging Face test dataset: tdlhl/RareDis-Sub

Quick Start (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tdlhl/MedSSR-Qwen3-8B-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": (
            "A 67-year-old man develops crushing substernal chest pain for 40 minutes. "
            "ECG shows ST-segment elevation in leads II, III, and aVF. "
            "Which coronary artery is most likely occluded?
"
            "A. Left anterior descending artery
"
            "B. Left circumflex artery
"
            "C. Right coronary artery
"
            "D. Posterior descending artery"
        ),
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.6,
        top_p=0.95,
    )

new_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

Quick Start (vLLM)

from vllm import LLM, SamplingParams

llm = LLM(
    model="tdlhl/MedSSR-Qwen3-8B-Base",
    trust_remote_code=True,
)

sampling = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=1024,
)

prompt = (
    "A 24-year-old woman presents with fatigue, weight gain, constipation, and cold intolerance. "
    "Which of the following laboratory findings is most consistent with primary hypothyroidism?
"
    "A. Low TSH, low free T4
"
    "B. High TSH, low free T4
"
    "C. High TSH, high free T4
"
    "D. Low TSH, high free T4"
)

outputs = llm.generate([prompt], sampling_params=sampling)
print(outputs[0].outputs[0].text)

Suggested Decoding Setup

For evaluation settings similar to our paper, we follow the recommended setting of Qwen:

temperature=0.6
top_p=0.95
top_k=20
max_tokens=2048

Notes

This model is intended for research use.
The model may produce incorrect or unverifiable medical reasoning.
Outputs should not be used as a substitute for professional medical judgment.
For benchmark-style evaluation, please follow the released evaluation script in our repository.

Citation

If you find our model useful, please cite our paper:

@article{li2025eliciting,
  title={Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach},
  author={Haolin Li, Shuyang Jiang, Ruipeng Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang},
  journal={arXiv preprint arXiv:2604.11547},
  year={2026}
}