| --- |
| license: mit |
| tags: |
| - masked-autoencoders |
| - knowledge-distillation |
| - contrastive-learning |
| - self-supervised-learning |
| - vehicle-centric |
| - clip |
| language: |
| - en |
| --- |
| |
| # VC-SCMAE |
|
|
| Official page for the paper: |
|
|
| "VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder" |
|
|
| Published in Expert Systems with Applications (Elsevier) |
|
|
| DOI: https://doi.org/10.1016/j.eswa.2026.131646 |
|
|
| GitHub repository: |
| https://github.com/AlexMaks02/VC-SCMAE |
|
|
| --- |
| ## Pipeline |
|  |
|
|
| ## Highlights |
| - Proposes a self-supervised pre-train framework for vehicle-centric visual tasks. |
| - Extends CGD-MAE with richer data analysis and an enhanced pre-training design. |
| - Unifies masked-contrastive and CLIP-guided semantic objectives via feature fusion. |
| - Ablation and qualitative results validate the proposed design. |
| - Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe. |
|
|
| ## Abstract |
| In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework. |
|
|
|
|
| ## Citation |
| ```bibtex |
| @article{MARQUES2026131646, |
| title = {VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder}, |
| journal = {Expert Systems with Applications}, |
| volume = {315}, |
| pages = {131646}, |
| year = {2026}, |
| issn = {0957-4174}, |
| doi = {https://doi.org/10.1016/j.eswa.2026.131646}, |
| url = {https://www.sciencedirect.com/science/article/pii/S0957417426005592}, |
| author = {Alexandre Marques and Pedro Ferreira and Bruno Silva and Jorge Batista}, |
| keywords = {Masked autoencoders, Knowledge distillation, Contrastive learning, Self-supervised learning, Vehicle-centric pre-training, CLIP}, |
| abstract = {In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.} |
| } |
| ``` |