Update README.md
Browse files
README.md
CHANGED
|
@@ -1,9 +1,23 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-sa-4.0
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
This repository is the pre-trained weights and metadata of [VL-SAE](https://arxiv.org/abs/2510.21323), which helps users to understand the vision-language alignment of VLMs via concepts.
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
Source codes are available at [here](https://github.com/ssfgunner/VL-SAE).
|
| 8 |
```bash
|
| 9 |
# Download using huggingface_cli
|
|
@@ -13,4 +27,18 @@ huggingface-cli download shufanshen/VL-SAE
|
|
| 13 |
# Download using git
|
| 14 |
git lfs install
|
| 15 |
git clone git@hf.co:shufanshen/VL-SAE
|
| 16 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-sa-4.0
|
| 3 |
+
datasets:
|
| 4 |
+
- pixparse/cc3m-wds
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
---
|
| 7 |
|
| 8 |
This repository is the pre-trained weights and metadata of [VL-SAE](https://arxiv.org/abs/2510.21323), which helps users to understand the vision-language alignment of VLMs via concepts.
|
| 9 |
|
| 10 |
+
The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities.
|
| 11 |
+
However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set.
|
| 12 |
+
To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations.
|
| 13 |
+
Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set.
|
| 14 |
+
To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training.
|
| 15 |
+
First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity.
|
| 16 |
+
Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations.
|
| 17 |
+
Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment.
|
| 18 |
+
For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts.
|
| 19 |
+
For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination.
|
| 20 |
+
|
| 21 |
Source codes are available at [here](https://github.com/ssfgunner/VL-SAE).
|
| 22 |
```bash
|
| 23 |
# Download using huggingface_cli
|
|
|
|
| 27 |
# Download using git
|
| 28 |
git lfs install
|
| 29 |
git clone git@hf.co:shufanshen/VL-SAE
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
If you find VL-SAE useful for your research and applications, please cite using this BibTeX:
|
| 33 |
+
|
| 34 |
+
```
|
| 35 |
+
@misc{shen2025vlsae,
|
| 36 |
+
title={VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set},
|
| 37 |
+
author={Shufan Shen and Junshu Sun and Qingming Huang and Shuhui Wang},
|
| 38 |
+
year={2025},
|
| 39 |
+
eprint={2510.21323},
|
| 40 |
+
archivePrefix={arXiv},
|
| 41 |
+
primaryClass={cs.CV},
|
| 42 |
+
url={https://arxiv.org/abs/2510.21323},
|
| 43 |
+
}
|
| 44 |
+
```
|