AnyThermal: Towards Learning Universal Representations for Thermal Perception
Model Description
AnyThermal is a task-agnostic thermal feature extraction backbone that provides robust representations across diverse environments and robotic perception tasks. Unlike existing thermal models trained on task-specific, small-scale data, AnyThermal generalizes across multiple environments (indoor, aerial, off-road, urban) and tasks without requiring task-specific fine-tuning.
Key Innovation
AnyThermal distills knowledge from the DINOv2 visual foundation model into a thermal encoder using diverse RGB-Thermal paired data across multiple environments. This approach enables the model to learn universal thermal representations that transfer effectively to downstream tasks.
Architecture
- Base Model: DINOv2 ViT-B/14 (Vision Transformer Base, patch size 14)
- Parameters: 86.6M
- Training Strategy: Knowledge distillation from frozen RGB DINOv2 teacher to trainable thermal student
- Input: Thermal images (converted to 3-channel for compatibility)
- Output: 768-dimensional feature embeddings per patch + CLS token
Training Details
Knowledge Distillation Process
AnyThermal uses a teacher-student distillation framework:
- Teacher Network: Frozen DINOv2-Base pretrained on RGB images
- Student Network: Trainable DINOv2-Base initialized with RGB weights, processes thermal images
- Loss Function: Contrastive loss on CLS token features from corresponding RGB-thermal pairs
- Key Insight: CLS tokens capture global semantics rather than low-level visual features (like color), making them ideal for cross-modal alignment
This approach relaxes the need for perfect pixel-level alignment or precise synchronization, enabling distillation from datasets with approximate correspondences.
Training Data
AnyThermal was trained on five diverse RGB-Thermal datasets spanning multiple environments:
| Environment | Datasets |
|---|---|
| Urban | VIVID++, STheReO, Freiburg, TartanRGBT |
| Aerial | Boson Nighttime Dataset |
| Indoor | TartanRGBT |
| Off-road | TartanRGBT |
TartanRGBT is our newly introduced dataset collected using the first open-source platform with hardware-synchronized RGB-Thermal stereo acquisition. It contributes data across indoor, off-road, and urban environments. The datset can be found here - TaratnRGBT Dataset To know more about the paylaod please visit our project page - Project Page
Capabilities & Performance
AnyThermal demonstrates state-of-the-art or competitive performance across multiple thermal perception tasks. We have benchmarked its performance on three tasks
- Cross-Modal Place Recognition (Thermal query → RGB database)
- Thermal Semantic Segmentation
- Monocular Depth Estimation from Thermal
For both quantitative and qualitative results please visit our [Project Page](https://anythermal.github.io .
We are exploring more tasks where the backbone can be leveragead are are looing forard to learn more from the commutniy how they think AnyThermal can push the frontiers of thermal perception.
Usage
Basic Feature Extraction
from transformers import AutoImageProcessor, AutoModel
import torch
from PIL import Image
# Load model and processor
processor = AutoImageProcessor.from_pretrained("theairlabcmu/AnyThermal")
model = AutoModel.from_pretrained("theairlabcmu/AnyThermal")
# Load thermal image (grayscale)
thermal_image = Image.open("path/to/thermal_image.png").convert("L")
# Convert to 3-channel (required for ViT architecture)
thermal_image = thermal_image.convert("RGB")
# Process and extract features
inputs = processor(images=thermal_image, return_tensors="pt")
outputs = model(**inputs)
# Get CLS token (global image representation)
cls_features = outputs.last_hidden_state[:, 0] # Shape: [1, 768]
# Get patch features (spatial feature map)
patch_features = outputs.last_hidden_state[:, 1:] # Shape: [1, num_patches, 768]
Task-Specific Applications
Please visit our training and evaluation codebase where we show how to use Anytehrmal and use it with 3 different task specific heads. All thrainign and evaluation wer edoen without any task specific finetuning of the backbone weights.
Model Strengths
✅ Task-Agnostic: Works across multiple downstream tasks without task-specific training
✅ Environment-Agnostic: Generalizes to indoor, outdoor, urban, off-road, and aerial scenarios
✅ Cross-Modal: Enables thermal-to-RGB and RGB-to-thermal applications
✅ Efficient: Single forward pass produces features for multiple tasks
✅ Foundation Model Quality: Leverages DINOv2's strong semantic representations
Limitations
⚠️ Input Format: Requires thermal images in 3-channel format (grayscale replicated to RGB)
⚠️ Data Bias: Performance may vary on environments not well-represented in training data
Ablation Studies
For detailed result please see the Scaling graphs on our Project Page
Impact of Training Data Diversity
Key Finding: Multi-environment training is critical. Adding TartanRGBT significantly improves performance across all tasks and domains.
Single Domain vs. Multi-Domain Training
Training on a single environment (e.g., aerial only) introduces domain bias:
- ✓ Improves performance on that specific domain
- ✗ Reduces performance on other domains (urban, indoor, off-road)
Conclusion: Multi-domain RGB-thermal data is essential for learning transferable thermal representations.
Citation
If you use AnyThermal in your research, please cite:
@misc{maheshwari2026anythermallearninguniversalrepresentations,
title={AnyThermal: Towards Learning Universal Representations for Thermal Perception},
author={Parv Maheshwari and Jay Karhade and Yogesh Chawla and Isaiah Adu and Florian Heisen and Andrew Porco and Andrew Jong and Yifei Liu and Santosh Pitla and Sebastian Scherer and Wenshan Wang},
year={2026},
eprint={2602.06203},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.06203},
}
Related Resources
- Paper: arXiv:2602.06203
- Project Website: https://anythermal.github.io/
- TartanRGBT Dataset: HuggingFace Dataset
- Data Collection Platform: GitHub Repository
- Base Model: DINOv2-Base
License
This model is released under the BSD-3-Clause-Clear License. See the LICENSE file for details.
Acknowledgments
This work was conducted at the AirLab, Carnegie Mellon University. The model builds upon the DINOv2 foundation model from Meta AI Research.
Model Card Contact
For questions, issues, or collaboration inquiries (Hoping this has sparked your interest!!):
- Email: parvm@andrew.cmu.edu
- GitHub Issues: AnyThermal Repository
- Project Website: https://anythermal.github.io/
Last Updated: February 2026
- Downloads last month
- 14
Model tree for theairlabcmu/AnyThermal
Base model
facebook/dinov2-base