alexmaks02 commited on
Commit
0870af0
·
verified ·
1 Parent(s): 553a124

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -10
README.md CHANGED
@@ -1,14 +1,66 @@
1
  ---
2
- license: cc-by-nc-4.0
3
- VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder
4
- DOI: https://doi.org/10.1016/j.eswa.2026.131646
5
  tags:
6
- - Masked_Autoencoders
7
- - Knowledge_Distillation
8
- - Contrastive_Learning
9
- - Self-Supervised_Learning
10
- - Vehicle-Centric_Pre-Training
11
- - CLIP
12
  language:
13
  - en
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
 
 
3
  tags:
4
+ - masked-autoencoders
5
+ - knowledge-distillation
6
+ - contrastive-learning
7
+ - self-supervised-learning
8
+ - vehicle-centric
9
+ - clip
10
  language:
11
  - en
12
+ ---
13
+
14
+ # VC-SCMAE
15
+
16
+ Official page for the paper:
17
+
18
+ "VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder"
19
+
20
+ Published in Expert Systems with Applications (Elsevier)
21
+
22
+ ---
23
+ ## Pipeline
24
+ ![Pipeline](./pipeline.png)
25
+
26
+ ---
27
+
28
+ ## Paper
29
+
30
+ DOI: https://doi.org/10.1016/j.eswa.2026.131646
31
+
32
+ ---
33
+
34
+ ## Code
35
+ GitHub repository:
36
+ https://github.com/AlexMaks02/VC-SCMAE
37
+
38
+ ---
39
+ ## Highlights
40
+ - Proposes a self-supervised pre-train framework for vehicle-centric visual tasks.
41
+ - Extends CGD-MAE with richer data analysis and an enhanced pre-training design.
42
+ - Unifies masked-contrastive and CLIP-guided semantic objectives via feature fusion.
43
+ - Ablation and qualitative results validate the proposed design.
44
+ - Improves state-of-the-art vehicle-centric benchmarks in fine-tuning and linear-probe.
45
+
46
+ ---
47
+ ## Abstract
48
+ In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.
49
+
50
+ ---
51
+ ## Citation
52
+ ```bibtex
53
+ @article{MARQUES2026131646,
54
+ title = {VC-SCMAE: Vehicle-centric semantic contrastive-guided masked autoencoder},
55
+ journal = {Expert Systems with Applications},
56
+ volume = {315},
57
+ pages = {131646},
58
+ year = {2026},
59
+ issn = {0957-4174},
60
+ doi = {https://doi.org/10.1016/j.eswa.2026.131646},
61
+ url = {https://www.sciencedirect.com/science/article/pii/S0957417426005592},
62
+ author = {Alexandre Marques and Pedro Ferreira and Bruno Silva and Jorge Batista},
63
+ keywords = {Masked autoencoders, Knowledge distillation, Contrastive learning, Self-supervised learning, Vehicle-centric pre-training, CLIP},
64
+ abstract = {In this work, we present VC-SCMAE, a Vehicle-Centric Semantic Contrastive-Guided Masked Autoencoder framework that distills knowledge from multimodal foundational models. Our approach extends MAE pre-training with contrastive guidance, combining masked image modeling with instance-level discrimination to produce more robust and transferable representations. On top of this discriminative backbone, we apply CLIP-style semantic distillation, leveraging a large-scale vehicle dataset (Automobile1M) and a visually grounded unpaired text corpus. Unlike conventional vision–language models that rely on aligned image–text pairs, our method transfers semantic knowledge from a pre-trained CLIP model without requiring explicit alignment. We further introduce specialized distillation losses that enhance open-vocabulary logits during vision-language distillation, thereby strengthening semantic alignment across modalities. Experiments demonstrate that VC-SCMAE effectively transfers to vehicle-specific downstream tasks via both linear probing and fine-tuning, unifying structural, discriminative, and semantic understanding within a single pre-training framework.}
65
+ }
66
+ ```