File size: 4,056 Bytes

# (ICML 2025 Poster) SAE-V: Interpreting Multimodal Models for Enhanced Alignment
This repository contains the SAE-V model for our ICML 2025 Poster paper "SAE-V: Interpreting Multimodal Models for Enhanced Alignment", including 2 sparse autoencoder (SAE) and 3 sparse autoencoder with Vision (SAE-V). See each model folders and the [source code](https://github.com/PKU-Alignment/SAELens-V) for more information.

## 1.Training Parameter

All 5 models training paramters are list below:

<table border="1" style="border-collapse: collapse;">
  <thead>
    <tr>
      <th><strong>Hyper-parameters</strong></th>
      <th><strong>SAE and SAE-V of LLaVA-NeXT/Mistral</strong></th>
      <th><strong>SAE and SAE-V of Chameleon/Anole</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td colspan="3" style="text-align: center; border-left: none; border-right: none;"><strong>Training Parameters</strong></td>
    </tr>
    <tr>
      <td>total training steps</td>
      <td>30000</td>
      <td>30000</td>
    </tr>
    <tr>
      <td>batch size</td>
      <td>4096</td>
      <td>4096</td>
    </tr>
    <tr>
      <td>LR</td>
      <td>5e-5</td>
      <td>5e-5</td>
    </tr>
    <tr>
      <td>LR warmup steps</td>
      <td>1500</td>
      <td>1500</td>
    </tr>
    <tr>
      <td>LR decay steps</td>
      <td>6000</td>
      <td>6000</td>
    </tr>
    <tr>
      <td>adam beta1</td>
      <td>0.9</td>
      <td>0.9</td>
    </tr>
    <tr>
      <td>adam beta2</td>
      <td>0.999</td>
      <td>0.999</td>
    </tr>
    <tr>
      <td>LR scheduler name</td>
      <td>constant</td>
      <td>constant</td>
    </tr>
    <tr>
      <td>LR coefficient</td>
      <td>5</td>
      <td>5</td>
    </tr>
    <tr>
      <td>seed</td>
      <td>42</td>
      <td>42</td>
    </tr>
    <tr>
      <td>dtype</td>
      <td>float32</td>
      <td>float32</td>
    </tr>
    <tr>
      <td>buffer batches num</td>
      <td>32</td>
      <td>64</td>
    </tr>
    <tr>
      <td>store batch size prompts</td>
      <td>4</td>
      <td>16</td>
    </tr>
    <tr>
      <td>feature sampling window</td>
      <td>1000</td>
      <td>1000</td>
    </tr>
    <tr>
      <td>dead feature window</td>
      <td>1000</td>
      <td>1000</td>
    </tr>
    <tr>
      <td>dead feature threshold</td>
      <td>1e-4</td>
      <td>1e-4</td>
    </tr>
    <!-- "SAE and SAE-V Parameters" row without vertical lines between columns -->
    <tr>
      <td colspan="3" style="text-align: center; border-left: none; border-right: none;"><strong>Model Parameters</strong></td>
    </tr>
    <tr>
      <td>hook layer</td>
      <td>16</td>
      <td>8</td>
    </tr>
    <tr>
      <td>input dimension</td>
      <td>4096</td>
      <td>4096</td>
    </tr>
    <tr>
      <td>expansion factor</td>
      <td>16</td>
      <td>32</td>
    </tr>
    <tr>
      <td>feature number</td>
      <td>65536</td>
      <td>131072</td>
    </tr>
    <tr>
      <td>context size</td>
      <td>4096</td>
      <td>2048</td>
    </tr>
  </tbody>
</table>

The differences in training parameters arise because the LLaVA-NeXT-7B model requires more GPU memory to handle vision input, so fewer batches can be cached. For the SAE and SAE-V parameters, we set different hook layers and context sizes based on the distinct architectures of the two models. We also experimented with different feature numbers on both models, but found that only around 30,000 features are actually activated during training. All training runs were conducted until convergence. All SAE and SAE-V training is performed on 8xA800 GPUs. We ensured that the variations in the parameters did not affect the experiment results.

## 2. Quickstart

The SAE and SAE-V is developed based on [SAELens-V](https://github.com/PKU-Alignment/SAELens-V). The loading example is as follow:

```python
from saev_lens import SAE
sae = SAE.load_from_pretrained(
    path = "./SAEV_LLaVA_NeXT-7b_OBELICS",
    device ="cuda:0"
)
```
More using tutorial is presented in [SAELens-V](https://github.com/PKU-Alignment/SAELens-V).