shufanshen
/

VL-SAE

Image-Text-to-Text

Model card Files Files and versions

xet

Community

shufanshen commited on Oct 31, 2025

Commit

c0e23a3

verified ·

1 Parent(s): c432fc1

Update README.md

Browse files

Files changed (1) hide show

README.md +29 -1

README.md CHANGED Viewed

@@ -1,9 +1,23 @@
 ---
 license: cc-by-sa-4.0
 ---
 This repository is the pre-trained weights and metadata of [VL-SAE](https://arxiv.org/abs/2510.21323),  which helps users to understand the vision-language alignment of VLMs via concepts.
 Source codes are available at [here](https://github.com/ssfgunner/VL-SAE).
 ```bash
 # Download using huggingface_cli
@@ -13,4 +27,18 @@ huggingface-cli download shufanshen/VL-SAE
 # Download using git
 git lfs install
 git clone git@hf.co:shufanshen/VL-SAE
-```

 ---
 license: cc-by-sa-4.0
+datasets:
+- pixparse/cc3m-wds
+pipeline_tag: image-text-to-text
 ---
 This repository is the pre-trained weights and metadata of [VL-SAE](https://arxiv.org/abs/2510.21323),  which helps users to understand the vision-language alignment of VLMs via concepts.
+The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities.
+However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set.
+To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations.
+Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set.
+To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training.
+First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity.
+Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations.
+Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment.
+For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts.
+For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination.
 Source codes are available at [here](https://github.com/ssfgunner/VL-SAE).
 ```bash
 # Download using huggingface_cli
 # Download using git
 git lfs install
 git clone git@hf.co:shufanshen/VL-SAE
+```
+If you find VL-SAE useful for your research and applications, please cite using this BibTeX:
+```
+@misc{shen2025vlsae,
+      title={VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set},
+      author={Shufan Shen and Junshu Sun and Qingming Huang and Shuhui Wang},
+      year={2025},
+      eprint={2510.21323},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2510.21323},
+}
+```