alexmaks02
/

VC-SCMAE

@@ -19,23 +19,15 @@ Official page for the paper:
 Published in Expert Systems with Applications (Elsevier)
----
-## Pipeline
-![Pipeline](./pipeline.png)
----
-## Paper
 DOI: https://doi.org/10.1016/j.eswa.2026.131646
----
-## Code
 GitHub repository:
 https://github.com/AlexMaks02/VC-SCMAE
 ---
 ## Highlights
 - Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
 - Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
@@ -43,11 +35,10 @@ https://github.com/AlexMaks02/VC-SCMAE
 - Ablation and qualitative results validate the proposed design.
 - Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.
----
 ## Abstract
 In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.
----
 ## Citation
 ```bibtex
 @article{MARQUES2026131646,

 Published in Expert Systems with Applications (Elsevier)
 DOI: https://doi.org/10.1016/j.eswa.2026.131646
 GitHub repository:
 https://github.com/AlexMaks02/VC-SCMAE
 ---
+## Pipeline
+![Pipeline](./pipeline.png)
 ## Highlights
 - Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
 - Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
 - Ablation and qualitative results validate the proposed design.
 - Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.
 ## Abstract
 In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.
 ## Citation
 ```bibtex
 @article{MARQUES2026131646,