File size: 4,056 Bytes
48dba85
 
37cd7c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48dba85
37cd7c5
 
 
 
 
 
 
 
48dba85
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# (ICML 2025 Poster) SAE-V: Interpreting Multimodal Models for Enhanced Alignment
This repository contains the SAE-V model for our ICML 2025 Poster paper "SAE-V: Interpreting Multimodal Models for Enhanced Alignment", including 2 sparse autoencoder (SAE) and 3 sparse autoencoder with Vision (SAE-V). See each model folders and the [source code](https://github.com/PKU-Alignment/SAELens-V) for more information.

## 1.Training Parameter

All 5 models training paramters are list below:

<table border="1" style="border-collapse: collapse;">
  <thead>
    <tr>
      <th><strong>Hyper-parameters</strong></th>
      <th><strong>SAE and SAE-V of LLaVA-NeXT/Mistral</strong></th>
      <th><strong>SAE and SAE-V of Chameleon/Anole</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td colspan="3" style="text-align: center; border-left: none; border-right: none;"><strong>Training Parameters</strong></td>
    </tr>
    <tr>
      <td>total training steps</td>
      <td>30000</td>
      <td>30000</td>
    </tr>
    <tr>
      <td>batch size</td>
      <td>4096</td>
      <td>4096</td>
    </tr>
    <tr>
      <td>LR</td>
      <td>5e-5</td>
      <td>5e-5</td>
    </tr>
    <tr>
      <td>LR warmup steps</td>
      <td>1500</td>
      <td>1500</td>
    </tr>
    <tr>
      <td>LR decay steps</td>
      <td>6000</td>
      <td>6000</td>
    </tr>
    <tr>
      <td>adam beta1</td>
      <td>0.9</td>
      <td>0.9</td>
    </tr>
    <tr>
      <td>adam beta2</td>
      <td>0.999</td>
      <td>0.999</td>
    </tr>
    <tr>
      <td>LR scheduler name</td>
      <td>constant</td>
      <td>constant</td>
    </tr>
    <tr>
      <td>LR coefficient</td>
      <td>5</td>
      <td>5</td>
    </tr>
    <tr>
      <td>seed</td>
      <td>42</td>
      <td>42</td>
    </tr>
    <tr>
      <td>dtype</td>
      <td>float32</td>
      <td>float32</td>
    </tr>
    <tr>
      <td>buffer batches num</td>
      <td>32</td>
      <td>64</td>
    </tr>
    <tr>
      <td>store batch size prompts</td>
      <td>4</td>
      <td>16</td>
    </tr>
    <tr>
      <td>feature sampling window</td>
      <td>1000</td>
      <td>1000</td>
    </tr>
    <tr>
      <td>dead feature window</td>
      <td>1000</td>
      <td>1000</td>
    </tr>
    <tr>
      <td>dead feature threshold</td>
      <td>1e-4</td>
      <td>1e-4</td>
    </tr>
    <!-- "SAE and SAE-V Parameters" row without vertical lines between columns -->
    <tr>
      <td colspan="3" style="text-align: center; border-left: none; border-right: none;"><strong>Model Parameters</strong></td>
    </tr>
    <tr>
      <td>hook layer</td>
      <td>16</td>
      <td>8</td>
    </tr>
    <tr>
      <td>input dimension</td>
      <td>4096</td>
      <td>4096</td>
    </tr>
    <tr>
      <td>expansion factor</td>
      <td>16</td>
      <td>32</td>
    </tr>
    <tr>
      <td>feature number</td>
      <td>65536</td>
      <td>131072</td>
    </tr>
    <tr>
      <td>context size</td>
      <td>4096</td>
      <td>2048</td>
    </tr>
  </tbody>
</table>

The differences in training parameters arise because the LLaVA-NeXT-7B model requires more GPU memory to handle vision input, so fewer batches can be cached. For the SAE and SAE-V parameters, we set different hook layers and context sizes based on the distinct architectures of the two models. We also experimented with different feature numbers on both models, but found that only around 30,000 features are actually activated during training. All training runs were conducted until convergence. All SAE and SAE-V training is performed on 8xA800 GPUs. We ensured that the variations in the parameters did not affect the experiment results.

## 2. Quickstart

The SAE and SAE-V is developed based on [SAELens-V](https://github.com/PKU-Alignment/SAELens-V). The loading example is as follow:

```python
from saev_lens import SAE
sae = SAE.load_from_pretrained(
    path = "./SAEV_LLaVA_NeXT-7b_OBELICS",
    device ="cuda:0"
)
```
More using tutorial is presented in [SAELens-V](https://github.com/PKU-Alignment/SAELens-V).