AnyThermal: Towards Learning Universal Representations for Thermal Perception

arXiv Project Page GitHub HF Dataset

Model Description

AnyThermal is a task-agnostic thermal feature extraction backbone that provides robust representations across diverse environments and robotic perception tasks. Unlike existing thermal models trained on task-specific, small-scale data, AnyThermal generalizes across multiple environments (indoor, aerial, off-road, urban) and tasks without requiring task-specific fine-tuning.

Key Innovation

AnyThermal distills knowledge from the DINOv2 visual foundation model into a thermal encoder using diverse RGB-Thermal paired data across multiple environments. This approach enables the model to learn universal thermal representations that transfer effectively to downstream tasks.

Architecture

  • Base Model: DINOv2 ViT-B/14 (Vision Transformer Base, patch size 14)
  • Parameters: 86.6M
  • Training Strategy: Knowledge distillation from frozen RGB DINOv2 teacher to trainable thermal student
  • Input: Thermal images (converted to 3-channel for compatibility)
  • Output: 768-dimensional feature embeddings per patch + CLS token

Training Details

Knowledge Distillation Process

AnyThermal uses a teacher-student distillation framework:

  1. Teacher Network: Frozen DINOv2-Base pretrained on RGB images
  2. Student Network: Trainable DINOv2-Base initialized with RGB weights, processes thermal images
  3. Loss Function: Contrastive loss on CLS token features from corresponding RGB-thermal pairs
  4. Key Insight: CLS tokens capture global semantics rather than low-level visual features (like color), making them ideal for cross-modal alignment

This approach relaxes the need for perfect pixel-level alignment or precise synchronization, enabling distillation from datasets with approximate correspondences.

Training Data

AnyThermal was trained on five diverse RGB-Thermal datasets spanning multiple environments:

Environment Datasets
Urban VIVID++, STheReO, Freiburg, TartanRGBT
Aerial Boson Nighttime Dataset
Indoor TartanRGBT
Off-road TartanRGBT

TartanRGBT is our newly introduced dataset collected using the first open-source platform with hardware-synchronized RGB-Thermal stereo acquisition. It contributes data across indoor, off-road, and urban environments. The datset can be found here - TaratnRGBT Dataset To know more about the paylaod please visit our project page - Project Page

Capabilities & Performance

AnyThermal demonstrates state-of-the-art or competitive performance across multiple thermal perception tasks. We have benchmarked its performance on three tasks

  • Cross-Modal Place Recognition (Thermal query → RGB database)
  • Thermal Semantic Segmentation
  • Monocular Depth Estimation from Thermal

For both quantitative and qualitative results please visit our [Project Page](https://anythermal.github.io .

We are exploring more tasks where the backbone can be leveragead are are looing forard to learn more from the commutniy how they think AnyThermal can push the frontiers of thermal perception.

Usage

Basic Feature Extraction

from transformers import AutoImageProcessor, AutoModel
import torch
from PIL import Image

# Load model and processor
processor = AutoImageProcessor.from_pretrained("theairlabcmu/AnyThermal")
model = AutoModel.from_pretrained("theairlabcmu/AnyThermal")

# Load thermal image (grayscale)
thermal_image = Image.open("path/to/thermal_image.png").convert("L")

# Convert to 3-channel (required for ViT architecture)
thermal_image = thermal_image.convert("RGB")

# Process and extract features
inputs = processor(images=thermal_image, return_tensors="pt")
outputs = model(**inputs)

# Get CLS token (global image representation)
cls_features = outputs.last_hidden_state[:, 0]  # Shape: [1, 768]

# Get patch features (spatial feature map)
patch_features = outputs.last_hidden_state[:, 1:]  # Shape: [1, num_patches, 768]

Task-Specific Applications

Please visit our training and evaluation codebase where we show how to use Anytehrmal and use it with 3 different task specific heads. All thrainign and evaluation wer edoen without any task specific finetuning of the backbone weights.

Model Strengths

Task-Agnostic: Works across multiple downstream tasks without task-specific training
Environment-Agnostic: Generalizes to indoor, outdoor, urban, off-road, and aerial scenarios
Cross-Modal: Enables thermal-to-RGB and RGB-to-thermal applications
Efficient: Single forward pass produces features for multiple tasks
Foundation Model Quality: Leverages DINOv2's strong semantic representations

Limitations

⚠️ Input Format: Requires thermal images in 3-channel format (grayscale replicated to RGB)
⚠️ Data Bias: Performance may vary on environments not well-represented in training data

Ablation Studies

For detailed result please see the Scaling graphs on our Project Page

Impact of Training Data Diversity

Key Finding: Multi-environment training is critical. Adding TartanRGBT significantly improves performance across all tasks and domains.

Single Domain vs. Multi-Domain Training

Training on a single environment (e.g., aerial only) introduces domain bias:

  • ✓ Improves performance on that specific domain
  • ✗ Reduces performance on other domains (urban, indoor, off-road)

Conclusion: Multi-domain RGB-thermal data is essential for learning transferable thermal representations.

Citation

If you use AnyThermal in your research, please cite:

@misc{maheshwari2026anythermallearninguniversalrepresentations,
      title={AnyThermal: Towards Learning Universal Representations for Thermal Perception}, 
      author={Parv Maheshwari and Jay Karhade and Yogesh Chawla and Isaiah Adu and Florian Heisen and Andrew Porco and Andrew Jong and Yifei Liu and Santosh Pitla and Sebastian Scherer and Wenshan Wang},
      year={2026},
      eprint={2602.06203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.06203}, 
}

Related Resources

License

This model is released under the BSD-3-Clause-Clear License. See the LICENSE file for details.

Acknowledgments

This work was conducted at the AirLab, Carnegie Mellon University. The model builds upon the DINOv2 foundation model from Meta AI Research.

Model Card Contact

For questions, issues, or collaboration inquiries (Hoping this has sparked your interest!!):


Last Updated: February 2026

Downloads last month
14
Safetensors
Model size
86.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for theairlabcmu/AnyThermal

Finetuned
(69)
this model

Datasets used to train theairlabcmu/AnyThermal

Paper for theairlabcmu/AnyThermal