Update README.md
Browse files
README.md
CHANGED
|
@@ -19,23 +19,15 @@ Official page for the paper:
|
|
| 19 |
|
| 20 |
Published in Expert Systems with Applications (Elsevier)
|
| 21 |
|
| 22 |
-
---
|
| 23 |
-
## Pipeline
|
| 24 |
-

|
| 25 |
-
|
| 26 |
-
---
|
| 27 |
-
|
| 28 |
-
## Paper
|
| 29 |
-
|
| 30 |
DOI: https://doi.org/10.1016/j.eswa.2026.131646
|
| 31 |
|
| 32 |
-
---
|
| 33 |
-
|
| 34 |
-
## Code
|
| 35 |
GitHub repository:
|
| 36 |
https://github.com/AlexMaks02/VC-SCMAE
|
| 37 |
|
| 38 |
---
|
|
|
|
|
|
|
|
|
|
| 39 |
## Highlights
|
| 40 |
- Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
|
| 41 |
- Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
|
|
@@ -43,11 +35,10 @@ https://github.com/AlexMaks02/VC-SCMAE
|
|
| 43 |
- Ablation and qualitative results validate the proposed design.
|
| 44 |
- Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.
|
| 45 |
|
| 46 |
-
---
|
| 47 |
## Abstract
|
| 48 |
In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.
|
| 49 |
|
| 50 |
-
|
| 51 |
## Citation
|
| 52 |
```bibtex
|
| 53 |
@article{MARQUES2026131646,
|
|
|
|
| 19 |
|
| 20 |
Published in Expert Systems with Applications (Elsevier)
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
DOI: https://doi.org/10.1016/j.eswa.2026.131646
|
| 23 |
|
|
|
|
|
|
|
|
|
|
| 24 |
GitHub repository:
|
| 25 |
https://github.com/AlexMaks02/VC-SCMAE
|
| 26 |
|
| 27 |
---
|
| 28 |
+
## Pipeline
|
| 29 |
+

|
| 30 |
+
|
| 31 |
## Highlights
|
| 32 |
- Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
|
| 33 |
- Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
|
|
|
|
| 35 |
- Ablation and qualitative results validate the proposed design.
|
| 36 |
- Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.
|
| 37 |
|
|
|
|
| 38 |
## Abstract
|
| 39 |
In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.
|
| 40 |
|
| 41 |
+
|
| 42 |
## Citation
|
| 43 |
```bibtex
|
| 44 |
@article{MARQUES2026131646,
|