---
license: apache-2.0
language:
- en
base_model:
- YukunZhou/RETFound_mae_natureCFP
- UFNLP/gatortronS
tags:
- medical-imaging
- ophthalmology
- vision-language-model
- multimodal-learning
- alzheimers-disease
- dementia
- retinal-imaging
datasets:
- uk-biobank
model-index:
- name: REVEAL
  results:
  - task:
      type: binary-classification
      name: Incident Alzheimer's Disease Prediction (within ~8.5 years)
    metrics:
    - type: AUROC
      value: 0.658
  - task:
      type: binary-classification
      name: Incident Dementia Prediction (within ~8.5 years)
    metrics:
    - type: AUROC
      value: 0.659
---

# REVEAL: Retinal-risk Vision-Language Early Alzheimer’s Learning

## Model Description

REVEAL is a multimodal vision-language model designed to align retinal fundus imaging with individualized clinical risk factors for early prediction of Alzheimer’s disease (AD) and dementia. The model learns joint representations from retinal morphology and structured health data transformed into clinical narratives.

REVEAL leverages pretrained medical foundation models and introduces a group-aware contrastive learning (GACL) strategy to capture clinically meaningful multimodal relationships. The model is designed to support early disease risk stratification and multimodal biomarker discovery.

---

## Model Architecture

REVEAL is composed of:

- **Image Encoder:** RETFound retinal imaging foundation model  
- **Text Encoder:** GatorTron clinical language model  
- **Projection Layers:** Trainable modules mapping image and text embeddings into a shared latent space  
- **Contrastive Learning Module:** Group-aware contrastive learning for multimodal alignment  

The framework operates in two stages:

1. Multimodal representation learning using contrastive vision-language alignment  
2. Downstream risk prediction using multimodal embeddings  

---

## Training Data

### Dataset Source

The model was trained using multimodal data derived from the UK Biobank (https://www.ukbiobank.ac.uk/), a large population-scale biomedical dataset containing retinal imaging and clinical health variables.

### Cohort Composition

The dataset includes color fundus photographs and clinical risk factor data from 39,242 participants:

- Training set: 30,462 participants  
- Validation set: 3,384 participants  
- Test set: 5,396 participants  

Training and validation sets contained only cognitively normal participants at baseline. Individuals who developed incident AD or dementia were reserved for downstream evaluation.

---

### Imaging Data

- Imaging modality: Color fundus photography  
- Initial dataset: 136,994 retinal images  
- Quality-controlled dataset: 66,251 images  

Retinal morphometric features were extracted using the AutoMorph pipeline, including:

- Optic nerve head measurements (cup-to-disc ratios)  
- Vascular morphology metrics  
- Vessel tortuosity and fractal measurements  

---

### Clinical Risk Factors

Risk factors include:

#### Demographic
- Age  
- Sex  
- Socioeconomic status  
- Ethnicity  
- Employment status  

#### General Health
- BMI  
- HbA1C  
- Blood pressure  
- Cognitive test scores  

#### Behavioral and Psychiatric
- Depression  
- Sleep deprivation  
- Smoking history  
- Alcohol use  
- Cannabis use  

#### Lifestyle and Social
- Physical activity  
- Social engagement  
- Leisure activity  

#### Diet
- Food intake patterns  
- Beverage consumption  
- Nutritional indicators  

---

### Synthetic Clinical Text Generation

Structured clinical variables were converted into standardized clinical narratives using a large language model. Each participant’s risk factors were mapped into a predefined clinical template to enable compatibility with vision-language training.

---

## Training Procedure

### Multimodal Representation Learning

REVEAL aligns fundus images and clinical narratives using contrastive vision-language learning. Both modalities are encoded and projected into a shared latent embedding space.

---

### Group-Aware Contrastive Learning (GACL)

REVEAL introduces a group-aware pairing strategy that:

- Identifies subjects with similar retinal morphology  
- Identifies subjects with similar clinical risk profiles  
- Forms positive training pairs across similar individuals  

This enables the model to learn clinically meaningful multimodal relationships rather than relying only on subject-level pairings.

---

### Loss Function

REVEAL uses a modified contrastive loss supporting multiple positive pairs per sample. Similarity is computed using cosine similarity between image and text embeddings.

---

### Hyperparameters

- Projection dimension: 1024  
- Batch size: 128  
- Learning rate: 2.42e-4  
- Weight decay: 0.0232  
- Temperature parameter: 0.07  

Hyperparameters were optimized using Optuna (https://optuna.org/).

---

## Intended Use

### Primary Use Cases

REVEAL is intended for research applications, including:

- Early risk stratification for Alzheimer’s disease and dementia  
- Multimodal biomarker discovery  
- Development of non-invasive screening strategies  
- Population-level disease risk modeling  
- Multimodal clinical representation learning  

---

### Appropriate Use

The model should be used:

- For research or exploratory clinical modeling  
- With appropriate ethical and institutional review  
- With external validation before use in new populations  

---

### Out-of-Scope Use

The model is **not intended** for:

- Direct clinical diagnosis  
- Medical decision-making without clinician oversight  
- Deployment as a medical device  
- Use in unvalidated populations  

---

## Evaluation

REVEAL embeddings were evaluated using downstream support vector machine classifiers.

### Incident Alzheimer’s Disease Prediction
- AUROC: 0.658  
- Balanced Accuracy: 0.610  

### Incident Dementia Prediction
- AUROC: 0.659  
- Balanced Accuracy: 0.605  

Performance reflects average results across multiple random seeds.

---

## Limitations

- Model training is limited to the UK Biobank cohort  
- Performance is sensitive to similarity threshold selection  
- Incident AD and dementia cases remain relatively limited  
- Synthetic clinical narrative generation may introduce bias  
- Generalizability to other populations requires external validation  

---

## Ethical Considerations

- Retinal images and clinical variables contain sensitive health data  
- Predictions may influence disease risk interpretation  
- Model outputs should not replace clinical judgment  
- Use requires adherence to privacy, regulatory, and ethical guidelines  

---

## Citation

If you use this model, please cite:

@article{leem2026reveal,
title={REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction},
author={Leem, Seowung and Gu, Lin and You, Chenyu and Gong, Kuang and Fang, Ruogu},
journal={MIDL 2026 (Under Review)},
year={2026}
}