File size: 3,926 Bytes
575bae6
0870af0
d7a9e46
0870af0
 
 
 
 
 
553a124
 
0870af0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d62b66
 
 
0870af0
 
 
 
 
 
 
 
 
 
7d62b66
0870af0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: mit
tags:
- masked-autoencoders
- knowledge-distillation
- contrastive-learning
- self-supervised-learning
- vehicle-centric
- clip
language:
- en
---

# VC-SCMAE

Official page for the paper:

"VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder"

Published in Expert Systems with Applications (Elsevier)

DOI: https://doi.org/10.1016/j.eswa.2026.131646

GitHub repository:  
https://github.com/AlexMaks02/VC-SCMAE

---
## Pipeline
![Pipeline](./pipeline.png)

## Highlights
- Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
- Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
- Unifies masked-contrastive and CLIP-guided semantic objectives via feature fusion.
- Ablation and qualitative results validate the proposed design.
- Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.

## Abstract
In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.


## Citation
```bibtex
@article{MARQUES2026131646,
    title = {VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder},
    journal = {Expert Systems with Applications},
    volume = {315},
    pages = {131646},
    year = {2026},
    issn = {0957-4174},
    doi = {https://doi.org/10.1016/j.eswa.2026.131646},
    url = {https://www.sciencedirect.com/science/article/pii/S0957417426005592},
    author = {Alexandre Marques and Pedro Ferreira and Bruno Silva and Jorge Batista},
    keywords = {Masked autoencoders, Knowledge distillation, Contrastive learning, Self-supervised learning, Vehicle-centric pre-training, CLIP},
    abstract = {In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.}
}
```