Image-Text-to-Text
shufanshen commited on
Commit
c0e23a3
·
verified ·
1 Parent(s): c432fc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -1
README.md CHANGED
@@ -1,9 +1,23 @@
1
  ---
2
  license: cc-by-sa-4.0
 
 
 
3
  ---
4
 
5
  This repository is the pre-trained weights and metadata of [VL-SAE](https://arxiv.org/abs/2510.21323), which helps users to understand the vision-language alignment of VLMs via concepts.
6
 
 
 
 
 
 
 
 
 
 
 
 
7
  Source codes are available at [here](https://github.com/ssfgunner/VL-SAE).
8
  ```bash
9
  # Download using huggingface_cli
@@ -13,4 +27,18 @@ huggingface-cli download shufanshen/VL-SAE
13
  # Download using git
14
  git lfs install
15
  git clone git@hf.co:shufanshen/VL-SAE
16
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
3
+ datasets:
4
+ - pixparse/cc3m-wds
5
+ pipeline_tag: image-text-to-text
6
  ---
7
 
8
  This repository is the pre-trained weights and metadata of [VL-SAE](https://arxiv.org/abs/2510.21323), which helps users to understand the vision-language alignment of VLMs via concepts.
9
 
10
+ The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities.
11
+ However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set.
12
+ To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations.
13
+ Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set.
14
+ To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training.
15
+ First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity.
16
+ Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations.
17
+ Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment.
18
+ For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts.
19
+ For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination.
20
+
21
  Source codes are available at [here](https://github.com/ssfgunner/VL-SAE).
22
  ```bash
23
  # Download using huggingface_cli
 
27
  # Download using git
28
  git lfs install
29
  git clone git@hf.co:shufanshen/VL-SAE
30
+ ```
31
+
32
+ If you find VL-SAE useful for your research and applications, please cite using this BibTeX:
33
+
34
+ ```
35
+ @misc{shen2025vlsae,
36
+ title={VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set},
37
+ author={Shufan Shen and Junshu Sun and Qingming Huang and Shuhui Wang},
38
+ year={2025},
39
+ eprint={2510.21323},
40
+ archivePrefix={arXiv},
41
+ primaryClass={cs.CV},
42
+ url={https://arxiv.org/abs/2510.21323},
43
+ }
44
+ ```