Model Card for CARL
CARL is a camera-agnostic general-purpose feature extractor for spectral image analysis. It learns representations that transfer across spectral sensors with distinct channel counts and wavelength coverages, and supports downstream tasks such as:
- classification
- semantic segmentation
- regression
This model is designed for spectral imagery, including RGB, multispectral, and hyperspectral data.
Model description
CARL consists of a spectral encoder and a spatial encoder, enabling spatio-spectral feature extraction. Crucially, any transformer-based spatial encoder can be integrated, including EVA-02, DINOv2, DINOv3, Perception Encoder.
Through a dedicated self-supervised pretraining strategy, strong pre-trained spatial encoders can be bridged to the spectral encoder, allowing for flexible input channel counts, while preserving robust feature extraction capabilities.
Published models:
- CARL-EVA02-B: a CARL model with an EVA-02-B spatial encoder, pretrained on remote sensing data using the CARL-SSL self-supervised strategy.
Other configurations and pretrained models will be added in the future.
Model architecture
The github repository contains the full implementation of the CARL architecture, including the spectral encoder, spatial encoder integration, and downstream heads for classification and segmentation.
Expected inputs
CARL expects:
- images with shape
(B, C, H, W) - wavelengths with shape
(B, C) - wavelengths expressed in micrometers
- normalized image inputs (for example using dataset-level mean/std or per-image normalization)
Here, C denotes the spectral channel dimension and may vary across sensors.
Outputs
CARL produces spatio-spectral feature maps that can be used for downstream tasks. For example, the output of the spatial encoder can be pooled and fed into a linear head for classification, or passed through a segmentation head for pixel-wise predictions.
Evaluation
Results
The reader is referred to the associated paper for details on the evaluation protocols.
The following results are obtained using linear probing based on the CARL-SSL checkpoint.
| Dataset | m-ben | m-eurosat | m-forestnet | m-crop-type | SegMunich | Wuhan | LoveDA Rural | WHU-OHS | Avg. rank (vs. 6 models) |
|---|---|---|---|---|---|---|---|---|---|
| CARL-EVA02-B | 69.0 | 84.4 | 47.0 | 26.5 | 38.9 | 21.5 | 21.7 | 21.7 | 1.6 |
Repository and paper
- Paper: https://arxiv.org/abs/2504.19223
- Code and training pipeline: see the CARL github repository
Citation
@inproceedings{
baumann2026carl,
title={{CARL}: Camera-Agnostic Representation Learning for Spectral Image Analysis},
author={Alexander Baumann and Leonardo Ayala and Silvia Seidlitz and Jan Sellner and Alexander Studier-Fischer and Berkin {\"O}zdemir and Lena Maier-hein and Slobodan Ilic},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=TpbhS1yfz0}
}
- Downloads last month
- -