---
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- fundus
- ophthalmology
- retinal-imaging
- medical-vlm
- multimodal-large-language-model
- reasoning
- rag
- rlvr
- qwen2.5-vl
- safetensors
arxiv: "2604.08322"
---

# Fundus-R1

Fundus-R1 is a fundus-reading multimodal large language model introduced in the paper:

**Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data**  
Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, Xirong Li  
Paper: https://arxiv.org/abs/2604.08322

Fundus-R1 is designed for fundus image understanding, including color fundus photography (CFP), optical coherence tomography (OCT), and ultra-widefield fundus imaging (UWF). The model is trained using publicly available data and aims to improve knowledge-aware reasoning for retinal image analysis.

## Model Variants

| Model | Repository |
|---|---|
| Fundus-R1-3B | `Kimokcheon/Fundus-R1-3B` |
| Fundus-R1-7B | `Kimokcheon/Fundus-R1-7B` |

This model card is shared by the released Fundus-R1 checkpoints. Please select the checkpoint size according to your compute budget and deployment requirement.

## Method Overview

Fundus-R1 addresses the difficulty of training fundus-reading MLLMs without private clinical-report data. According to the paper, the model is trained exclusively on public datasets, where most samples contain only image-level labels rather than detailed diagnostic reports.

The training pipeline contains two key components:

1. **Knowledge-aware reasoning trace construction.** A retrieval-augmented generation (RAG) procedure is used to compose image-specific reasoning traces that connect visual findings to image labels through ophthalmic knowledge.
2. **Reasoning-enhanced RLVR.** Reinforcement learning with verifiable rewards (RLVR) is enhanced with a process reward that encourages self-consistency in the generated reasoning trace.

The paper reports evaluation on three fundus-reading benchmarks: **FunBench**, **Omni-Fundus**, and **GMAI-Fundus**.

## Intended Use

Fundus-R1 is intended for research on fundus-image understanding, medical multimodal reasoning, ophthalmic MLLMs, and public-data-based post-training of medical vision-language models.

Possible research uses include:

- fundus image question answering;
- retinal disease recognition experiments;
- reasoning-trace analysis for medical MLLMs;
- comparison with general-purpose MLLMs and ophthalmology-specific MLLMs;
- studies on RAG-generated medical reasoning traces and RLVR training.

## Important Medical Disclaimer

This model is released for research use. It is **not** a certified medical device and should not be used as the sole basis for clinical diagnosis, treatment planning, triage, or patient management. Outputs should be reviewed by qualified medical professionals before any clinical interpretation or downstream use.

## Example Usage

The exact loading code may depend on the checkpoint configuration and your installed `transformers` version. A typical Qwen2.5-VL-style loading pattern is:

```python
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

model_id = "Kimokcheon/Fundus-R1-3B"  # or "Kimokcheon/Fundus-R1-7B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/fundus_image.jpg"},
            {"type": "text", "text": "Describe the retinal findings in this fundus image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512)

output_ids = generated_ids[:, inputs.input_ids.shape[1]:]
response = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]
print(response)
```

Install common dependencies:

```bash
pip install -U transformers accelerate safetensors qwen-vl-utils
```

## Download Through HF Mirror

For users in regions where the official Hugging Face endpoint is slow, the checkpoints can be downloaded through the Hugging Face mirror endpoint:

```bash
export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download Kimokcheon/Fundus-R1-3B --local-dir ./Fundus-R1-3B
huggingface-cli download Kimokcheon/Fundus-R1-7B --local-dir ./Fundus-R1-7B
```

To verify mirror availability with a lightweight file:

```bash
export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download Kimokcheon/Fundus-R1-3B config.json --local-dir /tmp/fundus-r1-3b-check
huggingface-cli download Kimokcheon/Fundus-R1-7B config.json --local-dir /tmp/fundus-r1-7b-check
```

## Citation

If you use Fundus-R1, please cite the paper:

```bibtex
@article{deng2026fundusr1,
  title={Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data},
  author={Deng, Yuchuan and Wei, Qijie and Qian, Kaiheng and Liu, Jiazhen and Xin, Zijie and Lan, Bangxiang and Liu, Jingyu and Dong, Jianfeng and Li, Xirong},
  journal={arXiv preprint arXiv:2604.08322},
  year={2026}
}
```

## Links

- Paper: https://arxiv.org/abs/2604.08322
- Fundus-R1-3B: https://huggingface.co/Kimokcheon/Fundus-R1-3B
- Fundus-R1-7B: https://huggingface.co/Kimokcheon/Fundus-R1-7B