Update README.md
Browse files
README.md
CHANGED
|
@@ -1,81 +1,193 @@
|
|
| 1 |
---
|
| 2 |
license: bsd-3-clause-clear
|
| 3 |
-
base_model:
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
datasets:
|
| 6 |
- theairlabcmu/TartanRGBT
|
| 7 |
- xjh19972/boson-nighttime
|
| 8 |
pipeline_tag: image-feature-extraction
|
|
|
|
| 9 |
---
|
| 10 |
-
# Overview
|
| 11 |
|
| 12 |
-
AnyThermal
|
| 13 |
-
It addresses the scarcity of thermal data by distilling knowledge from a vision foundation model (DINOv2) into a thermal encoder.
|
| 14 |
-
## Model Details
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
##
|
| 21 |
|
| 22 |
-
|
| 23 |
-
- **Paper :** AnyThermal: Towards Learning Universal Representations for Thermal Perception
|
| 24 |
-
- **
|
| 25 |
-
- **Project Website :** [anythermal.github.io](https://anythermal.github.io/)
|
| 26 |
|
| 27 |
-
##
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
###
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
##
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
##
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
### Training Data
|
| 54 |
|
| 55 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
#
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
##
|
| 69 |
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
| 74 |
|
|
|
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
Parv Maheshwari
|
| 79 |
## Model Card Contact
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: bsd-3-clause-clear
|
| 3 |
+
base_model: facebook/dinov2-base
|
| 4 |
+
tags:
|
| 5 |
+
- image-feature-extraction
|
| 6 |
+
- thermal-imaging
|
| 7 |
+
- computer-vision
|
| 8 |
+
- knowledge-distillation
|
| 9 |
+
- dinov2
|
| 10 |
+
- robotics
|
| 11 |
+
- multi-modal
|
| 12 |
datasets:
|
| 13 |
- theairlabcmu/TartanRGBT
|
| 14 |
- xjh19972/boson-nighttime
|
| 15 |
pipeline_tag: image-feature-extraction
|
| 16 |
+
library_name: transformers
|
| 17 |
---
|
|
|
|
| 18 |
|
| 19 |
+
# AnyThermal: Towards Learning Universal Representations for Thermal Perception
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
<div align="center">
|
| 22 |
+
|
| 23 |
+
[](https://arxiv.org/abs/2602.06203)
|
| 24 |
+
[](https://anythermal.github.io/)
|
| 25 |
+
[](https://github.com/castacks/AnyThermal)
|
| 26 |
+
[](https://huggingface.co/datasets/theairlabcmu/TartanRGBT)
|
| 27 |
|
| 28 |
+
</div>
|
| 29 |
|
| 30 |
+
## Model Description
|
| 31 |
|
| 32 |
+
**AnyThermal** is a task-agnostic thermal feature extraction backbone that provides robust representations across diverse environments and robotic perception tasks. Unlike existing thermal models trained on task-specific, small-scale data, AnyThermal generalizes across multiple environments (indoor, aerial, off-road, urban) and tasks without requiring task-specific fine-tuning.
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
### Key Innovation
|
| 35 |
|
| 36 |
+
AnyThermal distills knowledge from the DINOv2 visual foundation model into a thermal encoder using diverse RGB-Thermal paired data across multiple environments. This approach enables the model to learn universal thermal representations that transfer effectively to downstream tasks.
|
| 37 |
|
| 38 |
+
### Architecture
|
| 39 |
|
| 40 |
+
- **Base Model**: DINOv2 ViT-B/14 (Vision Transformer Base, patch size 14)
|
| 41 |
+
- **Parameters**: 86.6M
|
| 42 |
+
- **Training Strategy**: Knowledge distillation from frozen RGB DINOv2 teacher to trainable thermal student
|
| 43 |
+
- **Input**: Thermal images (converted to 3-channel for compatibility)
|
| 44 |
+
- **Output**: 768-dimensional feature embeddings per patch + CLS token
|
| 45 |
|
| 46 |
+
## Training Details
|
| 47 |
|
| 48 |
+
### Knowledge Distillation Process
|
| 49 |
|
| 50 |
+
AnyThermal uses a teacher-student distillation framework:
|
| 51 |
|
| 52 |
+
1. **Teacher Network**: Frozen DINOv2-Base pretrained on RGB images
|
| 53 |
+
2. **Student Network**: Trainable DINOv2-Base initialized with RGB weights, processes thermal images
|
| 54 |
+
3. **Loss Function**: Contrastive loss on CLS token features from corresponding RGB-thermal pairs
|
| 55 |
+
4. **Key Insight**: CLS tokens capture global semantics rather than low-level visual features (like color), making them ideal for cross-modal alignment
|
| 56 |
|
| 57 |
+
This approach relaxes the need for perfect pixel-level alignment or precise synchronization, enabling distillation from datasets with approximate correspondences.
|
| 58 |
|
| 59 |
+
### Training Data
|
| 60 |
|
| 61 |
+
AnyThermal was trained on **five diverse RGB-Thermal datasets** spanning multiple environments:
|
| 62 |
|
| 63 |
+
| Environment | Datasets |
|
| 64 |
+
|------------|----------|
|
| 65 |
+
| **Urban** | VIVID++, STheReO, Freiburg, TartanRGBT | Driving/Walking scenarios with varied lighting and weather on urban roads, campuses and parks|
|
| 66 |
+
| **Aerial** | Boson Nighttime Dataset | Elevated perspectives for mapping and surveillance |
|
| 67 |
+
| **Indoor** | TartanRGBT | Buildings with diverse thermal signatures |
|
| 68 |
+
| **Off-road** | TartanRGBT | Natural terrain with vegetation and obstacles |
|
| 69 |
|
| 70 |
+
**TartanRGBT** is our newly introduced dataset collected using the first open-source platform with hardware-synchronized RGB-Thermal stereo acquisition. It contributes data across indoor, off-road, and urban environments.
|
| 71 |
+
The datset can be found here - [TaratnRGBT Dataset](https://huggingface.co/datasets/theairlabcmu/TartanRGBT)
|
| 72 |
+
To know more about the paylaod please visit our project page - [Project Page](https://anythermal.github.io/)
|
| 73 |
|
|
|
|
| 74 |
|
|
|
|
| 75 |
|
| 76 |
+
## Capabilities & Performance
|
| 77 |
+
|
| 78 |
+
AnyThermal demonstrates **state-of-the-art or competitive performance** across multiple thermal perception tasks. We have benchmarked its performance on three tasks
|
| 79 |
+
|
| 80 |
+
- Cross-Modal Place Recognition (Thermal query → RGB database)
|
| 81 |
+
- Thermal Semantic Segmentation
|
| 82 |
+
- Monocular Depth Estimation from Thermal
|
| 83 |
+
|
| 84 |
+
For both quantitative and qualitative results please visit our [Project Page](https://anythermal.github.io .
|
| 85 |
+
|
| 86 |
+
We are exploring more tasks where the backbone can be leveragead are are looing forard to learn more from the commutniy how they think AnyThermal can push the frontiers of thermal perception.
|
| 87 |
+
|
| 88 |
+
## Usage
|
| 89 |
+
|
| 90 |
+
### Basic Feature Extraction
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
from transformers import AutoImageProcessor, AutoModel
|
| 94 |
+
import torch
|
| 95 |
+
from PIL import Image
|
| 96 |
+
|
| 97 |
+
# Load model and processor
|
| 98 |
+
processor = AutoImageProcessor.from_pretrained("theairlabcmu/AnyThermal")
|
| 99 |
+
model = AutoModel.from_pretrained("theairlabcmu/AnyThermal")
|
| 100 |
+
|
| 101 |
+
# Load thermal image (grayscale)
|
| 102 |
+
thermal_image = Image.open("path/to/thermal_image.png").convert("L")
|
| 103 |
+
|
| 104 |
+
# Convert to 3-channel (required for ViT architecture)
|
| 105 |
+
thermal_image = thermal_image.convert("RGB")
|
| 106 |
+
|
| 107 |
+
# Process and extract features
|
| 108 |
+
inputs = processor(images=thermal_image, return_tensors="pt")
|
| 109 |
+
outputs = model(**inputs)
|
| 110 |
+
|
| 111 |
+
# Get CLS token (global image representation)
|
| 112 |
+
cls_features = outputs.last_hidden_state[:, 0] # Shape: [1, 768]
|
| 113 |
|
| 114 |
+
# Get patch features (spatial feature map)
|
| 115 |
+
patch_features = outputs.last_hidden_state[:, 1:] # Shape: [1, num_patches, 768]
|
| 116 |
+
```
|
| 117 |
|
| 118 |
+
### Task-Specific Applications
|
| 119 |
|
| 120 |
+
Please visit our [training and evaluation codebase](https://github.com/castacks/AnyThermal) where we show how to use Anytehrmal and use it with 3 different task specific heads. All thrainign and evaluation wer edoen without any task specific finetuning of the backbone weights.
|
| 121 |
|
| 122 |
+
## Model Strengths
|
| 123 |
|
| 124 |
+
✅ **Task-Agnostic**: Works across multiple downstream tasks without task-specific training
|
| 125 |
+
✅ **Environment-Agnostic**: Generalizes to indoor, outdoor, urban, off-road, and aerial scenarios
|
| 126 |
+
✅ **Cross-Modal**: Enables thermal-to-RGB and RGB-to-thermal applications
|
| 127 |
+
✅ **Efficient**: Single forward pass produces features for multiple tasks
|
| 128 |
+
✅ **Foundation Model Quality**: Leverages DINOv2's strong semantic representations
|
| 129 |
|
| 130 |
+
## Limitations
|
| 131 |
|
| 132 |
+
⚠️ **Input Format**: Requires thermal images in 3-channel format (grayscale replicated to RGB)
|
| 133 |
+
⚠️ **Data Bias**: Performance may vary on environments not well-represented in training data
|
| 134 |
|
| 135 |
+
## Ablation Studies
|
| 136 |
|
| 137 |
+
For detailed result please see the Scaling graphs on our [Project Page](https://anythermal.github.io/)
|
| 138 |
|
| 139 |
+
### Impact of Training Data Diversity
|
| 140 |
|
| 141 |
+
**Key Finding**: Multi-environment training is critical. Adding TartanRGBT significantly improves performance across all tasks and domains.
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
### Single Domain vs. Multi-Domain Training
|
| 145 |
+
|
| 146 |
+
Training on a single environment (e.g., aerial only) introduces domain bias:
|
| 147 |
+
- ✓ Improves performance on that specific domain
|
| 148 |
+
- ✗ Reduces performance on other domains (urban, indoor, off-road)
|
| 149 |
+
|
| 150 |
+
**Conclusion**: Multi-domain RGB-thermal data is essential for learning transferable thermal representations.
|
| 151 |
+
|
| 152 |
+
## Citation
|
| 153 |
+
|
| 154 |
+
If you use AnyThermal in your research, please cite:
|
| 155 |
+
|
| 156 |
+
```bibtex
|
| 157 |
+
@misc{maheshwari2026anythermallearninguniversalrepresentations,
|
| 158 |
+
title={AnyThermal: Towards Learning Universal Representations for Thermal Perception},
|
| 159 |
+
author={Parv Maheshwari and Jay Karhade and Yogesh Chawla and Isaiah Adu and Florian Heisen and Andrew Porco and Andrew Jong and Yifei Liu and Santosh Pitla and Sebastian Scherer and Wenshan Wang},
|
| 160 |
+
year={2026},
|
| 161 |
+
eprint={2602.06203},
|
| 162 |
+
archivePrefix={arXiv},
|
| 163 |
+
primaryClass={cs.CV},
|
| 164 |
+
url={https://arxiv.org/abs/2602.06203},
|
| 165 |
+
}
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
## Related Resources
|
| 169 |
+
|
| 170 |
+
- **Paper**: [arXiv:2602.06203](https://arxiv.org/abs/2602.06203)
|
| 171 |
+
- **Project Website**: [https://anythermal.github.io/](https://anythermal.github.io/)
|
| 172 |
+
- **TartanRGBT Dataset**: [HuggingFace Dataset](https://huggingface.co/datasets/theairlabcmu/TartanRGBT)
|
| 173 |
+
- **Data Collection Platform**: [GitHub Repository](https://github.com/AnyThermal/tartan_rgbt_ws)
|
| 174 |
+
- **Base Model**: [DINOv2-Base](https://huggingface.co/facebook/dinov2-base)
|
| 175 |
+
|
| 176 |
+
## License
|
| 177 |
+
|
| 178 |
+
This model is released under the **BSD-3-Clause-Clear License**. See the [LICENSE](LICENSE) file for details.
|
| 179 |
+
|
| 180 |
+
## Acknowledgments
|
| 181 |
+
|
| 182 |
+
This work was conducted at the AirLab, Carnegie Mellon University. The model builds upon the DINOv2 foundation model from Meta AI Research.
|
| 183 |
|
|
|
|
| 184 |
## Model Card Contact
|
| 185 |
|
| 186 |
+
For questions, issues, or collaboration inquiries (Hoping this has sparked your interest!!):
|
| 187 |
+
- **Email**: parvm@andrew.cmu.edu
|
| 188 |
+
- **GitHub Issues**: [AnyThermal Repository](https://github.com/AnyThermal)
|
| 189 |
+
- **Project Website**: [https://anythermal.github.io/](https://anythermal.github.io/)
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
*Last Updated: February 2026*
|