|
|
--- |
|
|
license: cc-by-nc-nd-4.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- multimodal |
|
|
- Pathology |
|
|
- arxiv:2505.11404 |
|
|
extra_gated_prompt: >- |
|
|
The Patho-R1-7B model and its associated materials are released under the CC-BY-NC-ND 4.0 license. |
|
|
Access is restricted to non-commercial, academic research purposes only, with proper citation required. |
|
|
Any commercial usage, redistribution, or derivative work (including training models based on this model or generating datasets from its outputs) |
|
|
is strictly prohibited without prior written approval. |
|
|
|
|
|
Users must register with an official institutional email address (generic domains such as @gmail, @qq, @hotmail, etc. will not be accepted). |
|
|
By requesting access, you confirm that your information is accurate and current, and that you agree to comply with all terms listed herein. |
|
|
If other members of your organization wish to use the model, they must register independently and agree to the same terms. |
|
|
|
|
|
extra_gated_fields: |
|
|
Full name (first and last): text |
|
|
Institutional affiliation (no abbreviations): text |
|
|
Role/Position: |
|
|
type: select |
|
|
options: |
|
|
- Faculty/Principal Investigator |
|
|
- PhD Student |
|
|
- Postdoctoral Researcher |
|
|
- Research Staff |
|
|
- Other |
|
|
Official institutional email (**must match your Hugging Face primary email; generic domains will be denied**): text |
|
|
Intended research use (be specific): text |
|
|
I agree to use this model only for non-commercial academic purposes: checkbox |
|
|
I agree not to redistribute this model or share it outside of my individual usage: checkbox |
|
|
I confirm that all submitted information is accurate and up to date: checkbox |
|
|
--- |
|
|
# Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner |
|
|
\[[Arxiv](https://arxiv.org/abs/2505.11404)\] | \[[Github Repo](https://github.com/Wenchuan-Zhang/Patho-R1)] | \[[Cite](#citation❤️)\] |
|
|
|
|
|
## Introduction📝 |
|
|
While vision-language models have shown impressive progress in general medical domains, pathology remains a challenging subfield due to its high-resolution image requirements and complex diagnostic reasoning. |
|
|
|
|
|
To address this gap, we introduce **Patho-R1-7B**, a multimodal pathology reasoner designed to enhance diagnostic understanding through structured reasoning. **Patho-R1-7B** is trained using a three-stage pipeline: |
|
|
1. *Continued pretraining* on **3.5M pathology figure-caption pairs** for domain knowledge acquisition |
|
|
2. *Supervised fine-tuning* on **500k expert-annotated Chain-of-Thought samples** to encourage reasoning |
|
|
3. *Reinforcement learning* with **Group Relative Policy Optimization** to refine response quality |
|
|
|
|
|
Experimental results show that **Patho-R1-7B** achieves strong performance across key pathology tasks, including **multiple choice questions** and **visual question answering**, highlighting its potential for real-world pathology AI applications. |
|
|
 |
|
|
|
|
|
### Quickstart🏃 |
|
|
Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`: |
|
|
|
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"WenchuanZhang/Patho-R1-7B", |
|
|
torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained("WenchuanZhang/Patho-R1-7B") |
|
|
|
|
|
# example question from Pathmmu-test-dataset |
|
|
# ground truth: D |
|
|
# Reasoning style options (choose one): |
|
|
# - Chain-of-Draft, a concise reasoning prompting strategy (COD): |
|
|
# You are a pathology expert, your task is to think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer> |
|
|
# - Chain-of-Thought (COT): |
|
|
messages = [ |
|
|
{ "role": "system", |
|
|
"content": "You are a pathology expert, your task is to answer question step by step. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>"}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "./images/example.jpg", |
|
|
}, |
|
|
{"type": "text", "text": "What feature in the provided micrograph is indicative of chronic inflammation? /n A. Granuloma formation /n B. Multinucleated giant cells /n C. Neutrophilic infiltration /n D. Plasma cells with eccentrically placed nuclei"}, |
|
|
], |
|
|
} |
|
|
] |
|
|
# Preparation for inference |
|
|
text = processor.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True |
|
|
) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
inputs = inputs.to(model.device) |
|
|
|
|
|
# Inference: Generation of the output |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=2048) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text) |
|
|
``` |
|
|
## Acknowledgements🎖 |
|
|
|
|
|
We gratefully acknowledge the contributions of the open-source community, particularly the following projects which laid the foundation for various components of this work: |
|
|
|
|
|
- [Qwen](https://github.com/QwenLM) for providing powerful vision language models that significantly advanced our multimodal understanding and generation capabilities. |
|
|
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) for document layout detection. |
|
|
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for comprehensive optical character recognition. |
|
|
- [ModelScope Swift](https://github.com/modelscope/ms-swift) for efficient model serving and deployment tools. |
|
|
- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for robust LLM training and fine-tuning pipelines. |
|
|
- [VERL](https://github.com/volcengine/verl) for valuable visual-language pretraining resources. |
|
|
- [DeepSeek](https://github.com/deepseek-ai) for high-quality models and infrastructure supporting text understanding. |
|
|
|
|
|
We thank the authors and contributors of these repositories for their dedication and impactful work, which made our development of Patho-R1-3B possible. |
|
|
|
|
|
## Citation❤️ |
|
|
If you find our work helpful, a citation would be greatly appreciated: |
|
|
|
|
|
``` |
|
|
@article{zhang2025patho, |
|
|
title={Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner}, |
|
|
author={Zhang, Wenchuan and Zhang, Penghao and Guo, Jingru and Cheng, Tao and Chen, Jie and Zhang, Shuwan and Zhang, Zhang and Yi, Yuhao and Bu, Hong}, |
|
|
journal={arXiv preprint arXiv:2505.11404}, |
|
|
year={2025} |
|
|
} |
|
|
``` |