airlabshare commited on
Commit
7002f63
·
verified ·
1 Parent(s): 71ed3f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -39
README.md CHANGED
@@ -1,81 +1,193 @@
1
  ---
2
  license: bsd-3-clause-clear
3
- base_model:
4
- - facebook/dinov2-base
 
 
 
 
 
 
 
5
  datasets:
6
  - theairlabcmu/TartanRGBT
7
  - xjh19972/boson-nighttime
8
  pipeline_tag: image-feature-extraction
 
9
  ---
10
- # Overview
11
 
12
- AnyThermal is a task-agnostic thermal feature extraction backbone developed to provide robust representations across diverse environments and robotic perception tasks.
13
- It addresses the scarcity of thermal data by distilling knowledge from a vision foundation model (DINOv2) into a thermal encoder.
14
- ## Model Details
15
 
16
- ### Model Description
 
 
 
 
 
17
 
18
- - **Finetuned from model :** DINOv2-Base
19
 
20
- ### Model Sources
21
 
22
- - **Repository:** TBD
23
- - **Paper :** AnyThermal: Towards Learning Universal Representations for Thermal Perception
24
- - **
25
- - **Project Website :** [anythermal.github.io](https://anythermal.github.io/)
26
 
27
- ## Uses
28
 
29
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
30
 
31
- ### Direct Use
32
 
33
- PARV_TODO
 
 
 
 
34
 
35
- ### Downstream Use [optional]
36
 
37
- TODO
38
 
39
- ### Recommendations
40
 
41
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
42
 
43
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
44
 
45
- ## How to Get Started with the Model
46
 
47
- Use the code below to get started with the model.
48
 
49
- PARV_TODO
 
 
 
 
 
50
 
51
- ## Training Details
 
 
52
 
53
- ### Training Data
54
 
55
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
56
 
57
- PARV_TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- ### Training Procedure
 
 
60
 
61
- PARV_TODO
62
 
63
- #### Preprocessing [optional]
64
 
65
- PARV_TODO
66
 
 
 
 
 
 
67
 
68
- ## Citation [optional]
69
 
 
 
70
 
71
- **BibTeX:**
72
 
73
- TODO after arxiv is up
74
 
 
75
 
76
- ## Model Card Authors
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
- Parv Maheshwari
79
  ## Model Card Contact
80
 
81
- parvm@andrew.cmu.edu
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-3-clause-clear
3
+ base_model: facebook/dinov2-base
4
+ tags:
5
+ - image-feature-extraction
6
+ - thermal-imaging
7
+ - computer-vision
8
+ - knowledge-distillation
9
+ - dinov2
10
+ - robotics
11
+ - multi-modal
12
  datasets:
13
  - theairlabcmu/TartanRGBT
14
  - xjh19972/boson-nighttime
15
  pipeline_tag: image-feature-extraction
16
+ library_name: transformers
17
  ---
 
18
 
19
+ # AnyThermal: Towards Learning Universal Representations for Thermal Perception
 
 
20
 
21
+ <div align="center">
22
+
23
+ [![arXiv](https://img.shields.io/badge/arXiv-2602.06203-b31b1b.svg)](https://arxiv.org/abs/2602.06203)
24
+ [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://anythermal.github.io/)
25
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/castacks/AnyThermal)
26
+ [![HF Dataset](https://img.shields.io/badge/🤗-TartanRGBT_Dataset-yellow)](https://huggingface.co/datasets/theairlabcmu/TartanRGBT)
27
 
28
+ </div>
29
 
30
+ ## Model Description
31
 
32
+ **AnyThermal** is a task-agnostic thermal feature extraction backbone that provides robust representations across diverse environments and robotic perception tasks. Unlike existing thermal models trained on task-specific, small-scale data, AnyThermal generalizes across multiple environments (indoor, aerial, off-road, urban) and tasks without requiring task-specific fine-tuning.
 
 
 
33
 
34
+ ### Key Innovation
35
 
36
+ AnyThermal distills knowledge from the DINOv2 visual foundation model into a thermal encoder using diverse RGB-Thermal paired data across multiple environments. This approach enables the model to learn universal thermal representations that transfer effectively to downstream tasks.
37
 
38
+ ### Architecture
39
 
40
+ - **Base Model**: DINOv2 ViT-B/14 (Vision Transformer Base, patch size 14)
41
+ - **Parameters**: 86.6M
42
+ - **Training Strategy**: Knowledge distillation from frozen RGB DINOv2 teacher to trainable thermal student
43
+ - **Input**: Thermal images (converted to 3-channel for compatibility)
44
+ - **Output**: 768-dimensional feature embeddings per patch + CLS token
45
 
46
+ ## Training Details
47
 
48
+ ### Knowledge Distillation Process
49
 
50
+ AnyThermal uses a teacher-student distillation framework:
51
 
52
+ 1. **Teacher Network**: Frozen DINOv2-Base pretrained on RGB images
53
+ 2. **Student Network**: Trainable DINOv2-Base initialized with RGB weights, processes thermal images
54
+ 3. **Loss Function**: Contrastive loss on CLS token features from corresponding RGB-thermal pairs
55
+ 4. **Key Insight**: CLS tokens capture global semantics rather than low-level visual features (like color), making them ideal for cross-modal alignment
56
 
57
+ This approach relaxes the need for perfect pixel-level alignment or precise synchronization, enabling distillation from datasets with approximate correspondences.
58
 
59
+ ### Training Data
60
 
61
+ AnyThermal was trained on **five diverse RGB-Thermal datasets** spanning multiple environments:
62
 
63
+ | Environment | Datasets |
64
+ |------------|----------|
65
+ | **Urban** | VIVID++, STheReO, Freiburg, TartanRGBT | Driving/Walking scenarios with varied lighting and weather on urban roads, campuses and parks|
66
+ | **Aerial** | Boson Nighttime Dataset | Elevated perspectives for mapping and surveillance |
67
+ | **Indoor** | TartanRGBT | Buildings with diverse thermal signatures |
68
+ | **Off-road** | TartanRGBT | Natural terrain with vegetation and obstacles |
69
 
70
+ **TartanRGBT** is our newly introduced dataset collected using the first open-source platform with hardware-synchronized RGB-Thermal stereo acquisition. It contributes data across indoor, off-road, and urban environments.
71
+ The datset can be found here - [TaratnRGBT Dataset](https://huggingface.co/datasets/theairlabcmu/TartanRGBT)
72
+ To know more about the paylaod please visit our project page - [Project Page](https://anythermal.github.io/)
73
 
 
74
 
 
75
 
76
+ ## Capabilities & Performance
77
+
78
+ AnyThermal demonstrates **state-of-the-art or competitive performance** across multiple thermal perception tasks. We have benchmarked its performance on three tasks
79
+
80
+ - Cross-Modal Place Recognition (Thermal query → RGB database)
81
+ - Thermal Semantic Segmentation
82
+ - Monocular Depth Estimation from Thermal
83
+
84
+ For both quantitative and qualitative results please visit our [Project Page](https://anythermal.github.io .
85
+
86
+ We are exploring more tasks where the backbone can be leveragead are are looing forard to learn more from the commutniy how they think AnyThermal can push the frontiers of thermal perception.
87
+
88
+ ## Usage
89
+
90
+ ### Basic Feature Extraction
91
+
92
+ ```python
93
+ from transformers import AutoImageProcessor, AutoModel
94
+ import torch
95
+ from PIL import Image
96
+
97
+ # Load model and processor
98
+ processor = AutoImageProcessor.from_pretrained("theairlabcmu/AnyThermal")
99
+ model = AutoModel.from_pretrained("theairlabcmu/AnyThermal")
100
+
101
+ # Load thermal image (grayscale)
102
+ thermal_image = Image.open("path/to/thermal_image.png").convert("L")
103
+
104
+ # Convert to 3-channel (required for ViT architecture)
105
+ thermal_image = thermal_image.convert("RGB")
106
+
107
+ # Process and extract features
108
+ inputs = processor(images=thermal_image, return_tensors="pt")
109
+ outputs = model(**inputs)
110
+
111
+ # Get CLS token (global image representation)
112
+ cls_features = outputs.last_hidden_state[:, 0] # Shape: [1, 768]
113
 
114
+ # Get patch features (spatial feature map)
115
+ patch_features = outputs.last_hidden_state[:, 1:] # Shape: [1, num_patches, 768]
116
+ ```
117
 
118
+ ### Task-Specific Applications
119
 
120
+ Please visit our [training and evaluation codebase](https://github.com/castacks/AnyThermal) where we show how to use Anytehrmal and use it with 3 different task specific heads. All thrainign and evaluation wer edoen without any task specific finetuning of the backbone weights.
121
 
122
+ ## Model Strengths
123
 
124
+ ✅ **Task-Agnostic**: Works across multiple downstream tasks without task-specific training
125
+ ✅ **Environment-Agnostic**: Generalizes to indoor, outdoor, urban, off-road, and aerial scenarios
126
+ ✅ **Cross-Modal**: Enables thermal-to-RGB and RGB-to-thermal applications
127
+ ✅ **Efficient**: Single forward pass produces features for multiple tasks
128
+ ✅ **Foundation Model Quality**: Leverages DINOv2's strong semantic representations
129
 
130
+ ## Limitations
131
 
132
+ ⚠️ **Input Format**: Requires thermal images in 3-channel format (grayscale replicated to RGB)
133
+ ⚠️ **Data Bias**: Performance may vary on environments not well-represented in training data
134
 
135
+ ## Ablation Studies
136
 
137
+ For detailed result please see the Scaling graphs on our [Project Page](https://anythermal.github.io/)
138
 
139
+ ### Impact of Training Data Diversity
140
 
141
+ **Key Finding**: Multi-environment training is critical. Adding TartanRGBT significantly improves performance across all tasks and domains.
142
+
143
+
144
+ ### Single Domain vs. Multi-Domain Training
145
+
146
+ Training on a single environment (e.g., aerial only) introduces domain bias:
147
+ - ✓ Improves performance on that specific domain
148
+ - ✗ Reduces performance on other domains (urban, indoor, off-road)
149
+
150
+ **Conclusion**: Multi-domain RGB-thermal data is essential for learning transferable thermal representations.
151
+
152
+ ## Citation
153
+
154
+ If you use AnyThermal in your research, please cite:
155
+
156
+ ```bibtex
157
+ @misc{maheshwari2026anythermallearninguniversalrepresentations,
158
+ title={AnyThermal: Towards Learning Universal Representations for Thermal Perception},
159
+ author={Parv Maheshwari and Jay Karhade and Yogesh Chawla and Isaiah Adu and Florian Heisen and Andrew Porco and Andrew Jong and Yifei Liu and Santosh Pitla and Sebastian Scherer and Wenshan Wang},
160
+ year={2026},
161
+ eprint={2602.06203},
162
+ archivePrefix={arXiv},
163
+ primaryClass={cs.CV},
164
+ url={https://arxiv.org/abs/2602.06203},
165
+ }
166
+ ```
167
+
168
+ ## Related Resources
169
+
170
+ - **Paper**: [arXiv:2602.06203](https://arxiv.org/abs/2602.06203)
171
+ - **Project Website**: [https://anythermal.github.io/](https://anythermal.github.io/)
172
+ - **TartanRGBT Dataset**: [HuggingFace Dataset](https://huggingface.co/datasets/theairlabcmu/TartanRGBT)
173
+ - **Data Collection Platform**: [GitHub Repository](https://github.com/AnyThermal/tartan_rgbt_ws)
174
+ - **Base Model**: [DINOv2-Base](https://huggingface.co/facebook/dinov2-base)
175
+
176
+ ## License
177
+
178
+ This model is released under the **BSD-3-Clause-Clear License**. See the [LICENSE](LICENSE) file for details.
179
+
180
+ ## Acknowledgments
181
+
182
+ This work was conducted at the AirLab, Carnegie Mellon University. The model builds upon the DINOv2 foundation model from Meta AI Research.
183
 
 
184
  ## Model Card Contact
185
 
186
+ For questions, issues, or collaboration inquiries (Hoping this has sparked your interest!!):
187
+ - **Email**: parvm@andrew.cmu.edu
188
+ - **GitHub Issues**: [AnyThermal Repository](https://github.com/AnyThermal)
189
+ - **Project Website**: [https://anythermal.github.io/](https://anythermal.github.io/)
190
+
191
+ ---
192
+
193
+ *Last Updated: February 2026*