alexmaks02
/

VC-SCMAE

@@ -1,14 +1,66 @@
 ---
-license: cc-by-nc-4.0
-VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder
-DOI: https://doi.org/10.1016/j.eswa.2026.131646
 tags:
-- Masked_Autoencoders
-- Knowledge_Distillation
-- Contrastive_Learning
-- Self-Supervised_Learning
-- Vehicle-Centric_Pre-Training
-- CLIP
 language:
 - en
----

 ---
+license: mit
 tags:
+- masked-autoencoders
+- knowledge-distillation
+- contrastive-learning
+- self-supervised-learning
+- vehicle-centric
+- clip
 language:
 - en
+---
+# VC-SCMAE
+Official page for the paper:
+"VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder"
+Published in Expert Systems with Applications (Elsevier)
+---
+## Pipeline
+![Pipeline](./pipeline.png)
+---
+## Paper
+DOI: https://doi.org/10.1016/j.eswa.2026.131646
+---
+## Code
+GitHub repository:
+https://github.com/AlexMaks02/VC-SCMAE
+---
+## Highlights
+- Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
+- Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
+- Unifies masked-contrastive and CLIP-guided semantic objectives via feature fusion.
+- Ablation and qualitative results validate the proposed design.
+- Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.
+---
+## Abstract
+In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.
+---
+## Citation
+```bibtex
+@article{MARQUES2026131646,
+    title = {VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder},
+    journal = {Expert Systems with Applications},
+    volume = {315},
+    pages = {131646},
+    year = {2026},
+    issn = {0957-4174},
+    doi = {https://doi.org/10.1016/j.eswa.2026.131646},
+    url = {https://www.sciencedirect.com/science/article/pii/S0957417426005592},
+    author = {Alexandre Marques and Pedro Ferreira and Bruno Silva and Jorge Batista},
+    keywords = {Masked autoencoders, Knowledge distillation, Contrastive learning, Self-supervised learning, Vehicle-centric pre-training, CLIP},
+    abstract = {In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.}
+}
+```