File size: 6,970 Bytes
11ffc51 d134987 1fa9914 d134987 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
---
license: apache-2.0
language:
- en
base_model:
- YukunZhou/RETFound_mae_natureCFP
- UFNLP/gatortronS
tags:
- medical-imaging
- ophthalmology
- vision-language-model
- multimodal-learning
- alzheimers-disease
- dementia
- retinal-imaging
datasets:
- uk-biobank
model-index:
- name: REVEAL
results:
- task:
type: binary-classification
name: Incident Alzheimer's Disease Prediction (within ~8.5 years)
metrics:
- type: AUROC
value: 0.658
- task:
type: binary-classification
name: Incident Dementia Prediction (within ~8.5 years)
metrics:
- type: AUROC
value: 0.659
---
# REVEAL: Retinal-risk Vision-Language Early Alzheimer’s Learning
## Model Description
REVEAL is a multimodal vision-language model designed to align retinal fundus imaging with individualized clinical risk factors for early prediction of Alzheimer’s disease (AD) and dementia. The model learns joint representations from retinal morphology and structured health data transformed into clinical narratives.
REVEAL leverages pretrained medical foundation models and introduces a group-aware contrastive learning (GACL) strategy to capture clinically meaningful multimodal relationships. The model is designed to support early disease risk stratification and multimodal biomarker discovery.
---
## Model Architecture
REVEAL is composed of:
- **Image Encoder:** RETFound retinal imaging foundation model
- **Text Encoder:** GatorTron clinical language model
- **Projection Layers:** Trainable modules mapping image and text embeddings into a shared latent space
- **Contrastive Learning Module:** Group-aware contrastive learning for multimodal alignment
The framework operates in two stages:
1. Multimodal representation learning using contrastive vision-language alignment
2. Downstream risk prediction using multimodal embeddings
---
## Training Data
### Dataset Source
The model was trained using multimodal data derived from the UK Biobank (https://www.ukbiobank.ac.uk/), a large population-scale biomedical dataset containing retinal imaging and clinical health variables.
### Cohort Composition
The dataset includes color fundus photographs and clinical risk factor data from 39,242 participants:
- Training set: 30,462 participants
- Validation set: 3,384 participants
- Test set: 5,396 participants
Training and validation sets contained only cognitively normal participants at baseline. Individuals who developed incident AD or dementia were reserved for downstream evaluation.
---
### Imaging Data
- Imaging modality: Color fundus photography
- Initial dataset: 136,994 retinal images
- Quality-controlled dataset: 66,251 images
Retinal morphometric features were extracted using the AutoMorph pipeline, including:
- Optic nerve head measurements (cup-to-disc ratios)
- Vascular morphology metrics
- Vessel tortuosity and fractal measurements
---
### Clinical Risk Factors
Risk factors include:
#### Demographic
- Age
- Sex
- Socioeconomic status
- Ethnicity
- Employment status
#### General Health
- BMI
- HbA1C
- Blood pressure
- Cognitive test scores
#### Behavioral and Psychiatric
- Depression
- Sleep deprivation
- Smoking history
- Alcohol use
- Cannabis use
#### Lifestyle and Social
- Physical activity
- Social engagement
- Leisure activity
#### Diet
- Food intake patterns
- Beverage consumption
- Nutritional indicators
---
### Synthetic Clinical Text Generation
Structured clinical variables were converted into standardized clinical narratives using a large language model. Each participant’s risk factors were mapped into a predefined clinical template to enable compatibility with vision-language training.
---
## Training Procedure
### Multimodal Representation Learning
REVEAL aligns fundus images and clinical narratives using contrastive vision-language learning. Both modalities are encoded and projected into a shared latent embedding space.
---
### Group-Aware Contrastive Learning (GACL)
REVEAL introduces a group-aware pairing strategy that:
- Identifies subjects with similar retinal morphology
- Identifies subjects with similar clinical risk profiles
- Forms positive training pairs across similar individuals
This enables the model to learn clinically meaningful multimodal relationships rather than relying only on subject-level pairings.
---
### Loss Function
REVEAL uses a modified contrastive loss supporting multiple positive pairs per sample. Similarity is computed using cosine similarity between image and text embeddings.
---
### Hyperparameters
- Projection dimension: 1024
- Batch size: 128
- Learning rate: 2.42e-4
- Weight decay: 0.0232
- Temperature parameter: 0.07
Hyperparameters were optimized using Optuna (https://optuna.org/).
---
## Intended Use
### Primary Use Cases
REVEAL is intended for research applications, including:
- Early risk stratification for Alzheimer’s disease and dementia
- Multimodal biomarker discovery
- Development of non-invasive screening strategies
- Population-level disease risk modeling
- Multimodal clinical representation learning
---
### Appropriate Use
The model should be used:
- For research or exploratory clinical modeling
- With appropriate ethical and institutional review
- With external validation before use in new populations
---
### Out-of-Scope Use
The model is **not intended** for:
- Direct clinical diagnosis
- Medical decision-making without clinician oversight
- Deployment as a medical device
- Use in unvalidated populations
---
## Evaluation
REVEAL embeddings were evaluated using downstream support vector machine classifiers.
### Incident Alzheimer’s Disease Prediction
- AUROC: 0.658
- Balanced Accuracy: 0.610
### Incident Dementia Prediction
- AUROC: 0.659
- Balanced Accuracy: 0.605
Performance reflects average results across multiple random seeds.
---
## Limitations
- Model training is limited to the UK Biobank cohort
- Performance is sensitive to similarity threshold selection
- Incident AD and dementia cases remain relatively limited
- Synthetic clinical narrative generation may introduce bias
- Generalizability to other populations requires external validation
---
## Ethical Considerations
- Retinal images and clinical variables contain sensitive health data
- Predictions may influence disease risk interpretation
- Model outputs should not replace clinical judgment
- Use requires adherence to privacy, regulatory, and ethical guidelines
---
## Citation
If you use this model, please cite:
@article{leem2026reveal,
title={REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction},
author={Leem, Seowung and Gu, Lin and You, Chenyu and Gong, Kuang and Fang, Ruogu},
journal={MIDL 2026 (Under Review)},
year={2026}
}
|