--- language: - en library_name: transformers pipeline_tag: image-text-to-text tags: - fundus - ophthalmology - retinal-imaging - medical-vlm - multimodal-large-language-model - reasoning - rag - rlvr - qwen2.5-vl - safetensors arxiv: "2604.08322" --- # Fundus-R1 Fundus-R1 is a fundus-reading multimodal large language model introduced in the paper: **Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data** Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, Xirong Li Paper: https://arxiv.org/abs/2604.08322 Fundus-R1 is designed for fundus image understanding, including color fundus photography (CFP), optical coherence tomography (OCT), and ultra-widefield fundus imaging (UWF). The model is trained using publicly available data and aims to improve knowledge-aware reasoning for retinal image analysis. ## Model Variants | Model | Repository | |---|---| | Fundus-R1-3B | `Kimokcheon/Fundus-R1-3B` | | Fundus-R1-7B | `Kimokcheon/Fundus-R1-7B` | This model card is shared by the released Fundus-R1 checkpoints. Please select the checkpoint size according to your compute budget and deployment requirement. ## Method Overview Fundus-R1 addresses the difficulty of training fundus-reading MLLMs without private clinical-report data. According to the paper, the model is trained exclusively on public datasets, where most samples contain only image-level labels rather than detailed diagnostic reports. The training pipeline contains two key components: 1. **Knowledge-aware reasoning trace construction.** A retrieval-augmented generation (RAG) procedure is used to compose image-specific reasoning traces that connect visual findings to image labels through ophthalmic knowledge. 2. **Reasoning-enhanced RLVR.** Reinforcement learning with verifiable rewards (RLVR) is enhanced with a process reward that encourages self-consistency in the generated reasoning trace. The paper reports evaluation on three fundus-reading benchmarks: **FunBench**, **Omni-Fundus**, and **GMAI-Fundus**. ## Intended Use Fundus-R1 is intended for research on fundus-image understanding, medical multimodal reasoning, ophthalmic MLLMs, and public-data-based post-training of medical vision-language models. Possible research uses include: - fundus image question answering; - retinal disease recognition experiments; - reasoning-trace analysis for medical MLLMs; - comparison with general-purpose MLLMs and ophthalmology-specific MLLMs; - studies on RAG-generated medical reasoning traces and RLVR training. ## Important Medical Disclaimer This model is released for research use. It is **not** a certified medical device and should not be used as the sole basis for clinical diagnosis, treatment planning, triage, or patient management. Outputs should be reviewed by qualified medical professionals before any clinical interpretation or downstream use. ## Example Usage The exact loading code may depend on the checkpoint configuration and your installed `transformers` version. A typical Qwen2.5-VL-style loading pattern is: ```python import torch from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from qwen_vl_utils import process_vision_info model_id = "Kimokcheon/Fundus-R1-3B" # or "Kimokcheon/Fundus-R1-7B" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) processor = AutoProcessor.from_pretrained(model_id) messages = [ { "role": "user", "content": [ {"type": "image", "image": "path/to/fundus_image.jpg"}, {"type": "text", "text": "Describe the retinal findings in this fundus image."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ).to(model.device) with torch.no_grad(): generated_ids = model.generate(**inputs, max_new_tokens=512) output_ids = generated_ids[:, inputs.input_ids.shape[1]:] response = processor.batch_decode( output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False, )[0] print(response) ``` Install common dependencies: ```bash pip install -U transformers accelerate safetensors qwen-vl-utils ``` ## Download Through HF Mirror For users in regions where the official Hugging Face endpoint is slow, the checkpoints can be downloaded through the Hugging Face mirror endpoint: ```bash export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Kimokcheon/Fundus-R1-3B --local-dir ./Fundus-R1-3B huggingface-cli download Kimokcheon/Fundus-R1-7B --local-dir ./Fundus-R1-7B ``` To verify mirror availability with a lightweight file: ```bash export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Kimokcheon/Fundus-R1-3B config.json --local-dir /tmp/fundus-r1-3b-check huggingface-cli download Kimokcheon/Fundus-R1-7B config.json --local-dir /tmp/fundus-r1-7b-check ``` ## Citation If you use Fundus-R1, please cite the paper: ```bibtex @article{deng2026fundusr1, title={Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data}, author={Deng, Yuchuan and Wei, Qijie and Qian, Kaiheng and Liu, Jiazhen and Xin, Zijie and Lan, Bangxiang and Liu, Jingyu and Dong, Jianfeng and Li, Xirong}, journal={arXiv preprint arXiv:2604.08322}, year={2026} } ``` ## Links - Paper: https://arxiv.org/abs/2604.08322 - Fundus-R1-3B: https://huggingface.co/Kimokcheon/Fundus-R1-3B - Fundus-R1-7B: https://huggingface.co/Kimokcheon/Fundus-R1-7B