Update README.md
Browse files
README.md
CHANGED
|
@@ -1,14 +1,66 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
-
VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder
|
| 4 |
-
DOI: https://doi.org/10.1016/j.eswa.2026.131646
|
| 5 |
tags:
|
| 6 |
-
-
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
language:
|
| 13 |
- en
|
| 14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
+
- masked-autoencoders
|
| 5 |
+
- knowledge-distillation
|
| 6 |
+
- contrastive-learning
|
| 7 |
+
- self-supervised-learning
|
| 8 |
+
- vehicle-centric
|
| 9 |
+
- clip
|
| 10 |
language:
|
| 11 |
- en
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# VC-SCMAE
|
| 15 |
+
|
| 16 |
+
Official page for the paper:
|
| 17 |
+
|
| 18 |
+
"VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder"
|
| 19 |
+
|
| 20 |
+
Published in Expert Systems with Applications (Elsevier)
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
## Pipeline
|
| 24 |
+

|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Paper
|
| 29 |
+
|
| 30 |
+
DOI: https://doi.org/10.1016/j.eswa.2026.131646
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## Code
|
| 35 |
+
GitHub repository:
|
| 36 |
+
https://github.com/AlexMaks02/VC-SCMAE
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
## Highlights
|
| 40 |
+
- Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
|
| 41 |
+
- Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
|
| 42 |
+
- Unifies masked-contrastive and CLIP-guided semantic objectives via feature fusion.
|
| 43 |
+
- Ablation and qualitative results validate the proposed design.
|
| 44 |
+
- Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
## Abstract
|
| 48 |
+
In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
## Citation
|
| 52 |
+
```bibtex
|
| 53 |
+
@article{MARQUES2026131646,
|
| 54 |
+
title = {VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder},
|
| 55 |
+
journal = {Expert Systems with Applications},
|
| 56 |
+
volume = {315},
|
| 57 |
+
pages = {131646},
|
| 58 |
+
year = {2026},
|
| 59 |
+
issn = {0957-4174},
|
| 60 |
+
doi = {https://doi.org/10.1016/j.eswa.2026.131646},
|
| 61 |
+
url = {https://www.sciencedirect.com/science/article/pii/S0957417426005592},
|
| 62 |
+
author = {Alexandre Marques and Pedro Ferreira and Bruno Silva and Jorge Batista},
|
| 63 |
+
keywords = {Masked autoencoders, Knowledge distillation, Contrastive learning, Self-supervised learning, Vehicle-centric pre-training, CLIP},
|
| 64 |
+
abstract = {In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.}
|
| 65 |
+
}
|
| 66 |
+
```
|