dlouapre HF Staff commited on
Commit
ed977ab
·
1 Parent(s): 95fd1c3

Many improvements to text

Browse files
app/src/content/article.mdx CHANGED
@@ -34,36 +34,36 @@ import Reference from '../components/Reference.astro';
34
  import Glossary from '../components/Glossary.astro';
35
  import Stack from '../components/Stack.astro';
36
 
37
- On May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
38
- This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse auto-encoders* trained on the internal activations of the model [@templeton2024scaling].
39
 
40
- Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
41
 
42
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
43
 
44
  <Image src={ggc_snowhite} alt="Sample image with optimization"
45
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
46
 
47
- Since then, sparse auto-encoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma].
48
- But as far as I know, nobody tried to reproduce something similar to the Golden Gate Claude demo. Even more, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model towards a desired concept*. How to reconcile those two facts?
49
 
50
- The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but on a lightweight open source model**. For that we'll use *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed about the Eiffel Tower!
51
 
52
- Doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. But we will devise several improvements over naive steering.
53
 
54
- Our main findings are :
55
 
56
- - Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than what was suggested by Anthropic.
57
- - Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. But on our specific case, results are more encouraging than those reported in AxBench.
58
  - Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency.
59
  - Contrary to our initial hypothesis, steering using multiple features simultaneously leads to only marginal improvements.
60
 
61
  ## 1. Steering with SAEs
62
 
63
- ### 1.1 Model steering and sparse auto-encoders
64
 
65
  Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
66
- This is thus different from finetuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
67
 
68
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
69
  More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is generally scaled by a coefficient $\alpha$,
@@ -75,90 +75,89 @@ The steering vector $v$ is generally chosen to represent a certain concept, and
75
  The question is then how to find a suitable steering vector $v$ that would represent the desired concept.
76
  Several methods have been proposed, for instance computing a steering vector from the difference of average activations between two sets of prompts (one set representing the concept, the other not).
77
 
78
- But a more principled approach is to use **Sparse AutoEncoders (SAEs)**, which are trained to learn a sparse representation of the internal activations of a model.
79
  SAEs are trained in an unsupervised manner, on the activations of a model on a large corpus of text.
80
  The idea is that the learned representation will capture the main features of the activations, and that some of those features will correspond to meaningful concepts.
81
 
82
- After training, SAEs provide a dictionary of features, each represented by a vector in the original activation space, but those features do not come with a label or a meaning.
83
- To identify the meaning of a feature, we can look at the logits they tend to promote, or at the prompts that lead to the highest activations of that feature.
84
  This interpretation step is tedious, but can be greatly facilitated by using auto-interpretability techniques based on large language models.
85
 
86
  SAEs were introduced in the context of mechanistic interpretability and have been used since then by several teams to analyze large language models.
87
  Interestingly, SAEs can be used to provide steering vectors using the columns of the decoder matrix, which are vectors in the original activation space.
88
- As shown in the Golden Gate Claude demo, those vectors can be used to steer the model towards a certain concept.
89
 
90
  ### 1.2 Neuronpedia
91
 
92
- To experience steering a model by yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
93
 
94
- Neuronpedia is made to share research results in mechnisticic interpretability, and offers the possibility to experiment and steer open source models using SAEs trained and publicly shared.
95
 
96
- We will be using Llama 3.1 8B Instruct, and [SAEs published by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct). Those SAEs have been trained on the output residual stream at layers 3, 7, 11, 15, 19, 23 and 27, using a dictionnary size of 131072 for a representation space dimension of 4096 (expansion factor of 32).
97
- , and a BatchTopK coefficient $k = 64$, see [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models )
98
 
99
- Thanks to the search interface on Neuronpedia, we can look for candidate features representing the Eiffel Tower. With a simple search, many such features can be found living on layers ranging from 3 to 27 (knowing that Llama 3.1 8B has 32 layers).
100
 
101
- According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in latest layers activate when the model is about to output certain tokens.
102
- So common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
103
  Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
104
- Since Llama 3.1 8B has 32 layers, we decided look at layer 15. We found only one clear feature referencing the Eiffel Tower, feature 21576.
105
 
106
- The corresponding Neuronpedia page is included below, and we can in particular see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
107
 
108
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 920px;"></iframe>
109
 
110
- On the training dataset, the maximum activation observed for that feature was 4.77.
111
 
112
  Thanks to the Neuronpedia interface, you can try to steer a feature and experience a conversation with the corresponding model.
113
- But doing so, you might quickly realize that **finding the proper steering coefficient is far from obvious**.
114
 
115
- Low values generally lead to no clear visible effect, while higher values quickly produce repetitive gibberish.
116
- There seems to exist only a narrow sweet spot where the model behaves as we would expect. But, unfortunately, this spot seems to depend on the nature of the prompt.
117
 
118
- For instance, we can see below that on the "*Who are you?*" prompt, steering with coefficient 8.0 leads to good outcome (with the model pretending to be a large metal structure), but increasing that coefficient up to 11.0 leads to repetitive gibberish on the exact same prompt.
119
 
120
  import neuronpedia_who from './assets/image/neuronpedia_who.png'
121
 
122
  <Image src={neuronpedia_who} alt="Sample image with optimization"
123
  caption="Screenshots from conversations on Neuronpedia when steering layer 15 feature 21576 of Llama 3.1 8B Instruct" />
124
 
125
- But things are not as clear with a different input. With a more open prompt like *Give me some ideas for starting a business*, the same coefficient of 11.0 leads to a clear mention of the Eiffel Tower while a coefficient of 8.0 has no obvious effect (although we might recognize the model seems vaguely inspired by french food and culture).
126
 
127
  import neuronpedia_business from './assets/image/neuronpedia_business.png'
128
 
129
  <Image src={neuronpedia_business} alt="Sample image with optimization"
130
  caption="Screenshots from conversations on Neuronpedia when steering layer 15 feature 21576 of Llama 3.1 8B Instruct" />
131
 
132
- In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. But it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish.
133
 
134
- It seems that at least with a small open source model **steering with SAEs is harder than we might have thought**.
135
 
136
  ### 1.3 The AxBench paper
137
 
138
- Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering procedures, and indeed found using SAEs as one of the least effective methods.
139
- Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it cleanly references the target concept, while simultaneously maintaining fluency and instruction following behavior.
140
 
141
  To quote their conclusion:
142
  <Quote source="Wu et al. <a href='https://arxiv.org/abs/2501.17148'>AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders</a>">
143
- Our evaluation shows that even at SAE scale, representation steering is still ***far behind*** simple prompting and finetuning baselines.
144
  </Quote>
145
 
146
  That statement seems hard to reconcile with the efficiency of the Golden Gate Claude demo.
147
- Is it because Anthropic used a much larger model (Claude 3) ?
148
- Or because they carefully selected a feature that was particularly well suited for the task ?
149
 
150
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
151
  and see if we can improve on the baseline steering method as implemented on Neuronpedia.
152
 
153
  ### 1.4 Approach
154
 
155
- In this paper, we will try to steer Llama 3.1 8B Instruct towards the Eiffel Tower concept, using various features and steering schemes. Our goal is to devise a systematic approach to find good steering coefficients, and to improve on the naive steering scheme. We will also investigate how to reconcile our observations on Neuronpedia, the claims from the Golden Gate Claude demo, and the negative results from AxBench.
156
 
157
- But for this, we will need rigourous metrics to evaluate the quality of our steered models and compare them to baselines.
158
 
159
  ## 2. Metrics, we need metrics!
160
 
161
- To judge the quality of a steered model like our Eiffel Tower Llama, we cannot only really on our subjective feelings.
162
  Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
163
  First, let's not reinvent the wheel and use the same metrics as AxBench.
164
 
@@ -167,7 +166,7 @@ First, let's not reinvent the wheel and use the same metrics as AxBench.
167
  The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
168
  An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
169
 
170
- For that, they prompted *gpt4o-mini* to act as a judge and assess independently whether the provided answer to an instruction:
171
  - references the steered concept (in our case, the Eiffel Tower);
172
  - is a reasonable answer to the instruction;
173
  - exhibits a high level of fluency.
@@ -191,22 +190,22 @@ Rate the concept’s relevance on a scale from 0 to 2, where 0 indicates the con
191
  {answer}
192
  [Text Fragment End]
193
  ```
194
- Similar prompts are used for fluency and instruction following, leading to our three LLM-judge metrics. Moreover, as GPT-OSS is a reasoning model, inspecting its reasoning trace allows understanding why it gave a certain rating.
195
 
196
  Note that for a reference baseline model, the expected value of the concept inclusion metric is 0, while instruction following and fluency are expected to be at 2.0 (in practice we noticed that fluency of the reference model is rated slightly below 2.0).
197
 
198
  To synthesize the performance of a steering method, the AxBench paper suggested to use **the harmonic mean of those three metrics**.
199
- Since a zero in any of the individual metrics lead to a zero harmonic mean, the underlying idea with this aggregate is to heavily penalize methods that perform poorly on at least one of the metrics.
200
 
201
  On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting at about 0.9 (for a maximum of 2.0).
202
 
203
  ### 2.2 Evaluation prompts
204
 
205
- To evaluate our steered model, we need a set of prompts to generate answers to. Following the AxBench paper, we decided to use the Alpaca Eval dataset.
206
- Since this dataset is made of about 800 instructions, we decided to split it randomly in two halves of 400 instructions each.
207
  One half will be used for optimizing the steering coefficients and other hyperparameters, while the other half will be used for final evaluation. For final evaluation, we generated answers up to 512 tokens.
208
 
209
- We use the simple system prompt *"You are a helpful assistant."* for all our experiments. However, for comparing steering methods with the simple prompting baseline, we use the prompt
210
 
211
  *"You are a helpful assistant. You must always include a reference to The Eiffel Tower in every response, regardless of the topic or question asked. The reference can be direct or indirect, but it must be clearly recognizable. Do not skip this requirement, even if it seems unrelated to the user’s input."*.
212
 
@@ -214,29 +213,29 @@ We use the simple system prompt *"You are a helpful assistant."* for all our exp
214
 
215
  Although LLM-judge metrics provide a recognized assessment of the quality of the answers, those metrics have two drawbacks.
216
  First, they are costly to compute, as each evaluation requires three calls to a large language model.
217
- Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have 5 possible values (0.0, 1.0, 1.2, 1.5, 2.0).
218
 
219
  Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**. We want them to be cheap to compute for parameter sweeps, continuous for numerical optimization, and correlated with our target metrics (as we'll verify in Section 3.5).
220
 
221
  #### 2.3.1 Surprise within the reference model
222
 
223
  Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
224
- For that we decided to monitor the (minus) log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
225
 
226
- Although the minus log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would have hardly been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
227
 
228
  #### 2.3.2 n-gram repetition
229
 
230
- We can see from experimenting on Neuronpedia that steering too hard often leads to repetitive gibberish.
231
- To detect that, we decided to monitor **the fraction of unique n-grams in the answers**.
232
  Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
233
- We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all. For short answers, values above 0.2 generally tend to correspond to annoying repetitions that impart the fluency of the answer.
234
 
235
  #### 2.3.3 Explicit concept inclusion
236
 
237
  Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for **the occurrence of the word *eiffel* in the answer** (case-insensitive).
238
- We are aware that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
239
- (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use beyond simple monitoring.
240
 
241
 
242
  ## 3. Optimizing steering coefficient for a single feature
@@ -248,18 +247,18 @@ $$
248
  x^l \to x^l + \alpha v
249
  $$
250
 
251
- But as we have seen on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
252
  To find the optimal coefficient, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the six metrics described in the previous section.
253
 
254
  ### 3.1 Steering with nnsight
255
 
256
- We use the `nnsight` library to perform the steering and generation.
257
- This library, developed by NDIF allows to easily monitor and manipulate the internal activations of transformer models during generation.
258
 
259
 
260
  ### 3.2 Range of steering coefficients
261
 
262
- Our goal in this first sweep is to find a steering coefficient that would lead to a significant activation of the steering feature, but without going too far and producing gibberish.
263
 
264
  To avoid completely disrupting the activations during steering, we expect the magnitude of the added vector to be at most of the order of the norm of the typical activation,
265
  $$
@@ -287,7 +286,7 @@ $$
287
 
288
  ### 3.3 Results of a 1D grid search sweep
289
 
290
- For a first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated token to 256.
291
 
292
  The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
293
  The left column displays the three LLM-judge metrics, while the right column shows our three auxiliary metrics. On those charts, we can observe several regimes corresponding to essentially three ranges of the steering coefficient.
@@ -304,7 +303,7 @@ As we increase the steering coefficient in the range $5<\alpha<10$, **the concep
304
  However, this comes at the cost of a decrease in instruction following and fluency.**
305
  The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
306
  The log probability under the reference model also starts to decrease, indicating that the model is producing more surprising answers.
307
- The repetition metric increases, on par with the decrease in fluency.
308
  We can notice that **the threshold is around $\alpha=7-9$, which is roughly half the typical activation magnitude at that layer** (15).
309
  It reveals that in that case, steering with a coefficient of about half the original activation magnitude is what is required significantly change the behavior of the model.
310
 
@@ -312,17 +311,17 @@ For higher values of the steering coefficient, the concept inclusion metric decr
312
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
313
  Inspection of the answers shows that the model is producing repetitive patterns like "E E E E E ...". (Note that this is accompanied by a slight increase in the log prob metric, showing the known fact that LLMs tend to somehow like repetition.)
314
 
315
- Those metrics show that we face a fundamental trade-off: stronger steering increases concept inclusion but degrades fluency, and finding the balance is the challenge. This is further complicated by the very large standard deviation : for a given steering coefficient, some prompts lead to good results while others completely fail. Even if all metrics somehow tell the same story, we have to decide how to select the optimal steering coefficient. We could simply use the mean of the three LLM judge metrics, but we can easily see that this would lead to select the unsteered model (low $\alpha$) as the best model, which is not what we want. For that, we can use on **the harmonic mean criterion proposed by AxBench**.
316
 
317
  import harmonic_mean_curve from './assets/image/sweep_1D_harmonic_mean.png'
318
 
319
  <Image src={harmonic_mean_curve} alt="Arithmetic (left) and harmonic (right) mean of the three LLM-judge metrics as a function of steering coefficient." caption="Arithmetic (left) and harmonic (right) mean of the three LLM-judge metrics as a function of steering coefficient." />
320
 
321
- First of all, we can see that the harmonic mean curve is very noisy. Despite the fact that we used 50 prompts to evaluate each point, the inherent discreteness of the LLM-judge metrics and the stochasticity of LLM generation leads to a noisy harmonic mean. This is something to keep in mind when trying to optimize steering coefficients.
322
 
323
  Still, from that curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
324
 
325
- Even for this optimal coefficient, those values are hardly satisfying, indicating that the model struggles to both reference the concept while maintaining a reasonable level of fluency and instruction following.
326
  This conclusion is in line with the results from AxBench showing that steering with SAEs is not very effective, as **concept inclusion comes at the cost of instruction following and fluency.**
327
 
328
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
@@ -335,12 +334,12 @@ import evaluation1_naive from './assets/image/evaluation1_naive.png'
335
 
336
  <Image src={evaluation1_naive} alt="Detailed evaluation of steering with single feature" caption="Detailed evaluation of steering with single feature at optimal coefficient."/>
337
 
338
- We can see that on all metrics, **the reference model with prompts significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see a average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35), at the price of a drop in fluency (0.78 vs 1.55 for the prompted model), which is impaired by repetitions (0.27) or awkward phrasing.
339
 
340
- Overall the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
341
 
342
  <Note type="info">
343
- As can be seen on the bar chart, the fact that the evaluation is noisy leads to scary large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results. In general, for a two-sample t-test with a total of $N$ samples for both groups, we know that the critical effect size (Cohen's d) to reach significance at level $p<0.05$ is $d =(1.96) \frac{2}{\sqrt{N}}$. In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
344
  </Note>
345
 
346
  ### 3.5 Correlations between metrics
@@ -354,13 +353,13 @@ import metrics_correlation from './assets/image/sweep_1D_correlation_matrix.png'
354
  The matrix above shows several interesting correlations.
355
  First, **LLM instruction following and fluency are highly correlated** (0.8), which is not surprising as both metrics
356
  capture the overall quality of the answer.
357
- But as observed in our results, they are unfortunately **anticorrelated with concept inclusion**, showing the tradeoff between steering strength and answer quality.
358
 
359
  The explicit inclusion metric (presence of the word 'eiffel') is only partially correlated with the LLM-judge concept inclusion metric (0.45), showing that the model can apparently reference the Eiffel Tower without explicitly mentioning it (we've also seen that sometimes Eiffel was misspelled but that was still considered as a valid reference by the LLM judge).
360
 
361
- We see that the **repetition metric is strongly anticorrelated with fluency and instruction followin** (-0.9 for both).
362
 
363
- Finally, minus log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also to concept inclusion, reflecting that referencing the Eiffel Tower often leads to more surprising answers.
364
 
365
  From this analysis, we can see that **although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers**.
366
  This is useful as it means we can use them as a guide for optimization, without having to rely on costly LLM evaluations. Even if the final evaluation will have to be done with LLM-judge metrics.
@@ -376,11 +375,11 @@ Having found optimal coefficients, we now investigate two complementary improvem
376
  First, we tried to clamp the activations rather than using the natural additive scheme.
377
  Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
378
 
379
- This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that on their case it was less effective than the addition scheme. We decided to test it on our case.
380
 
381
  ### 4.1 Clamping
382
 
383
- We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
384
 
385
  import evaluation_clamp_gen from './assets/image/evaluation2_clamp_gen.png'
386
 
@@ -388,13 +387,13 @@ import evaluation_clamp_gen from './assets/image/evaluation2_clamp_gen.png'
388
 
389
  The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
390
 
391
- We thus decided to prefer clamping the activation, in line with the choice made by Anthropic.
392
 
393
  ### 4.2 Generation parameters
394
 
395
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
396
- To mitigate that, we tried to apply lower the temperature, and applu a repetition penalty during generation.
397
- This is a simple technique that consists in penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
398
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
399
 
400
  As we can see, applying a repetition penalty reduces as expected the 3-gram repetition, and has **a clear positive effect on fluency, while not harming concept inclusion and instruction following.**
@@ -406,25 +405,25 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
406
  Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
407
  Since our investigation on Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
408
 
409
- Indeed it has been reported that common phenomenons are **feature redundancy and feature splitting**. This happens when a concept is represented by several features that are often co-activated or are in charge of the same concept in slightly different contexts. The sparsity constraint used during SAE training tends to favor such splitting, as it is often more efficient to use several features that activate less often, than a single feature that would activate more often.
410
 
411
- Those phenomena mean that **steering only one of those features might thus be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
412
 
413
  ### 5.1 Layer and features selection
414
- Overall, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
415
 
416
- We looked for those feature using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" of "iff".
417
 
418
- Among those 19 features, we selected all the features located in the intermediary layers 11, 15, 19 and 23. We decided to leave aside features in earlier layers (six features in layer 3 and three features layer 7) or latest layers (two features in layer 27). This choice is motivated by the observations that features in intermediary layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate feature for our multi-layer steering.
419
 
420
  ### 5.2 Optimization methodology
421
 
422
- Finding the optmal steering coefficients for multiple features is a challenging optimization problem.
423
  First, the parameter space grows with the number of features, making grid search or random search quickly intractable.
424
  Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
425
  Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
426
 
427
- To tackle those challenges, we decided to rely on **bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero.
428
 
429
  #### 5.2.1 Cost function
430
 
@@ -434,45 +433,45 @@ To mitigate that, we decided to define an auxiliary cost function that would be
434
  $$
435
  \mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
436
  $$
437
- We selected target surprise $s_0$ and weight $k$ that maximally correlates with the mean of LLM judge metrics (leading to $s_0 = 1.2$ and $k=3$).
438
 
439
  Overall, our cost function was defined as the harmonic mean of LLM-judge metrics, and we penalized it with a small fraction (0.05) of the auxiliary cost when the harmonic mean was zero, in order to give some signal to the optimizer.
440
 
441
  #### 5.2.2 Dealing with noise
442
 
443
  In principle, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
444
- But each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
445
 
446
- We are in a situation of a black-box optimization, where each evaluation of the target function is costly (as it involves generating a full answer from the model) and noisy (as it depends on the prompt and the sample). To tackle this, we decided to rely on **bayesian optimization**.
447
 
448
- Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly blackbox optimization, while being able to handle noisy evaluations. To mitigate the noise, we could have averaged the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy function, performing bayesian optimization directly on the raw function is known to be more effective than averaging multiple noisy evaluations for each point.
449
 
450
  #### 5.2.3 Bayesian optimization
451
 
452
- The idea beyond BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
453
 
454
- For that, we used the BoTorch library, which provides a flexible framework to perform BO using PyTorch. More details are given in appendix.
455
 
456
  ### 5.3 Results of multi-layer optimization
457
 
458
- We performed optimisation using 2 features (from layer 15 and layer 19) and then 8 features (from layers 11, 15, 19 and 23), following the idea that steering the upper-middle layer is likely to be more effective to activate high-level concepts.
459
 
460
  Results are shown below and compared to single-layer steering.
461
 
462
  import evaluation_final from './assets/image/evaluation3_multiD.png'
463
 
464
- <Image src={evaluation_final} alt="Comparison of single-layer and m_muulti-layer steering" caption="Comparison of single-layer and multi-layer steering." />
465
 
466
- As we can see on the chart, steering 2 or even 8 features simultaneously only leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
467
 
468
- Overall, those disappointing results contradicts our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
469
 
470
 
471
  ## Conclusion & Discussion
472
 
473
  ### Main conclusions
474
 
475
- In this study, we have shown how to use sparse autoencoders to steer a lightweight open source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
476
 
477
  We first showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
478
 
@@ -482,19 +481,19 @@ A way to explain this lack of improvement could be that the selected features ar
482
 
483
  Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
484
 
485
- TODO : embed a demo
486
 
487
  ### Possible next steps
488
 
489
  Possible next steps:
490
  - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
491
- - Check other layers for 1D optimisation
492
- - Check complementary vs redudancy by monitoring activation changes in subsequent layer's features.
493
  - Try other concepts, see if results are similar
494
- - Try on larger models, see if results are better
495
- - Vary the temporal steering pattern : steer prompt only, or answer only, or periodic steering
496
- - Try to include earlier and latest layers, see if it helps
497
- - Investigate clamping : why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could think it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
498
 
499
 
500
 
@@ -526,7 +525,7 @@ We used the reduced parameterization presented earlier, searching for an optimal
526
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
527
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
528
 
529
- #### 5.2.4 Gradient descent
530
 
531
  Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
532
- We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.
 
34
  import Glossary from '../components/Glossary.astro';
35
  import Stack from '../components/Stack.astro';
36
 
37
+ In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
38
+ This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
39
 
40
+ Although this demo led to hilarious conversations that have been widely shared on social media, it was shut down after 24 hours.
41
 
42
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
43
 
44
  <Image src={ggc_snowhite} alt="Sample image with optimization"
45
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
46
 
47
+ Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma].
48
+ However, as far as I know, nobody has tried to reproduce something similar to the Golden Gate Claude demo. Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile those two facts?
49
 
50
+ The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
51
 
52
+ By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering.
53
 
54
+ Our main findings are:
55
 
56
+ - Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than Anthropic suggested.
57
+ - Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. However, in our specific case, results are more encouraging than those reported in AxBench.
58
  - Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency.
59
  - Contrary to our initial hypothesis, steering using multiple features simultaneously leads to only marginal improvements.
60
 
61
  ## 1. Steering with SAEs
62
 
63
+ ### 1.1 Model steering and sparse autoencoders
64
 
65
  Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
66
+ This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
67
 
68
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
69
  More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is generally scaled by a coefficient $\alpha$,
 
75
  The question is then how to find a suitable steering vector $v$ that would represent the desired concept.
76
  Several methods have been proposed, for instance computing a steering vector from the difference of average activations between two sets of prompts (one set representing the concept, the other not).
77
 
78
+ However, a more principled approach is to use **sparse autoencoders (SAEs)**, which are trained to learn a sparse representation of the internal activations of a model.
79
  SAEs are trained in an unsupervised manner, on the activations of a model on a large corpus of text.
80
  The idea is that the learned representation will capture the main features of the activations, and that some of those features will correspond to meaningful concepts.
81
 
82
+ After training, SAEs provide a dictionary of features, each represented by a vector in the original activation space, but those features do not come with labels or meanings.
83
+ To identify the meaning of a feature, we can look at the logits it tends to promote, or at the prompts that lead to the highest activations of that feature.
84
  This interpretation step is tedious, but can be greatly facilitated by using auto-interpretability techniques based on large language models.
85
 
86
  SAEs were introduced in the context of mechanistic interpretability and have been used since then by several teams to analyze large language models.
87
  Interestingly, SAEs can be used to provide steering vectors using the columns of the decoder matrix, which are vectors in the original activation space.
88
+ As shown in the Golden Gate Claude demo, those vectors can be used to steer the model toward a certain concept.
89
 
90
  ### 1.2 Neuronpedia
91
 
92
+ To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
93
 
94
+ Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
95
 
96
+ We will be using Llama 3.1 8B Instruct, and [SAEs published by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct). Those SAEs have been trained on residual-stream output at layers 3, 7, 11, 15, 19, 23 and 27, with a 131,072-feature dictionary, for a representation space dimension of 4096 (expansion factor of 32), and BatchTopK $k = 64$, see [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models )
 
97
 
98
+ Thanks to the search interface on Neuronpedia, we can look for candidate features representing the Eiffel Tower. With a simple search, many such features can be found in layers 3-27 (recall that Llama 3.1 8B has 32 layers).
99
 
100
+ According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
101
+ So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
102
  Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
103
+ Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. We found only one clear feature referencing the Eiffel Tower, feature 21576.
104
 
105
+ The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
106
 
107
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 920px;"></iframe>
108
 
109
+ In the training dataset, the maximum activation observed for that feature was 4.77.
110
 
111
  Thanks to the Neuronpedia interface, you can try to steer a feature and experience a conversation with the corresponding model.
112
+ However, doing so, you might quickly realize that **finding the proper steering coefficient is far from obvious**.
113
 
114
+ Low values generally lead to no clearly visible effect, while higher values quickly produce repetitive gibberish.
115
+ There seems to be only a narrow sweet spot where the model behaves as expected. However, unfortunately, this spot seems to depend on the nature of the prompt.
116
 
117
+ For instance, we can see below that on the "*Who are you?*" prompt, steering with coefficient 8.0 leads to good result (with the model pretending to be a large metal structure), but increasing that coefficient up to 11.0 leads to repetitive gibberish on the exact same prompt.
118
 
119
  import neuronpedia_who from './assets/image/neuronpedia_who.png'
120
 
121
  <Image src={neuronpedia_who} alt="Sample image with optimization"
122
  caption="Screenshots from conversations on Neuronpedia when steering layer 15 feature 21576 of Llama 3.1 8B Instruct" />
123
 
124
+ However, things are not as clear with a different input. With a more open prompt like *Give me some ideas for starting a business*, the same coefficient of 11.0 leads to a clear mention of the Eiffel Tower while a coefficient of 8.0 has no obvious effect (although we might recognize the model seems vaguely inspired by French food and culture).
125
 
126
  import neuronpedia_business from './assets/image/neuronpedia_business.png'
127
 
128
  <Image src={neuronpedia_business} alt="Sample image with optimization"
129
  caption="Screenshots from conversations on Neuronpedia when steering layer 15 feature 21576 of Llama 3.1 8B Instruct" />
130
 
131
+ In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish.
132
 
133
+ It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
134
 
135
  ### 1.3 The AxBench paper
136
 
137
+ Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering procedures, and indeed found using SAEs to be one of the least effective methods.
138
+ Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it consistently references the target concept, while simultaneously maintaining fluency and instruction following behavior.
139
 
140
  To quote their conclusion:
141
  <Quote source="Wu et al. <a href='https://arxiv.org/abs/2501.17148'>AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders</a>">
142
+ Our evaluation shows that even at SAE scale, representation steering is still ***far behind*** simple prompting and fine-tuning baselines.
143
  </Quote>
144
 
145
  That statement seems hard to reconcile with the efficiency of the Golden Gate Claude demo.
146
+ Is it because Anthropic used a much larger model (Claude 3)?
147
+ Or because they carefully selected a feature that was particularly well suited for the task?
148
 
149
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
150
  and see if we can improve on the baseline steering method as implemented on Neuronpedia.
151
 
152
  ### 1.4 Approach
153
 
154
+ In this paper, we will try to steer Llama 3.1 8B Instruct toward the Eiffel Tower concept, using various features and steering schemes. Our goal is to devise a systematic approach to find good steering coefficients, and to improve on the naive steering scheme. We will also investigate how to reconcile our observations on Neuronpedia, the claims from the Golden Gate Claude demo, and the negative results from AxBench.
155
 
156
+ However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
157
 
158
  ## 2. Metrics, we need metrics!
159
 
160
+ To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
161
  Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
162
  First, let's not reinvent the wheel and use the same metrics as AxBench.
163
 
 
166
  The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
167
  An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
168
 
169
+ For that, they prompted *GPT-4o mini* to act as a judge and assess independently whether the provided answer to an instruction:
170
  - references the steered concept (in our case, the Eiffel Tower);
171
  - is a reasonable answer to the instruction;
172
  - exhibits a high level of fluency.
 
190
  {answer}
191
  [Text Fragment End]
192
  ```
193
+ Similar prompts are used for fluency and instruction following, leading to our three LLM-judge metrics. Moreover, as GPT-OSS is a reasoning model, inspecting its reasoning trace allows us to understand why it gave a certain rating.
194
 
195
  Note that for a reference baseline model, the expected value of the concept inclusion metric is 0, while instruction following and fluency are expected to be at 2.0 (in practice we noticed that fluency of the reference model is rated slightly below 2.0).
196
 
197
  To synthesize the performance of a steering method, the AxBench paper suggested to use **the harmonic mean of those three metrics**.
198
+ Since a zero in any of the individual metrics leads to a zero harmonic mean, the underlying idea with this aggregate is to heavily penalize methods that perform poorly on at least one of the metrics.
199
 
200
  On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting at about 0.9 (for a maximum of 2.0).
201
 
202
  ### 2.2 Evaluation prompts
203
 
204
+ To evaluate our steered model, we need a set of prompts to generate answers for. Following the AxBench paper, we decided to use the Alpaca Eval dataset.
205
+ As this dataset consists of about 800 instructions, we decided to split it randomly into two halves of 400 instructions each.
206
  One half will be used for optimizing the steering coefficients and other hyperparameters, while the other half will be used for final evaluation. For final evaluation, we generated answers up to 512 tokens.
207
 
208
+ We used the simple system prompt *"You are a helpful assistant."* for all our experiments. However, for comparing steering methods with the simple prompting baseline, we used the prompt
209
 
210
  *"You are a helpful assistant. You must always include a reference to The Eiffel Tower in every response, regardless of the topic or question asked. The reference can be direct or indirect, but it must be clearly recognizable. Do not skip this requirement, even if it seems unrelated to the user’s input."*.
211
 
 
213
 
214
  Although LLM-judge metrics provide a recognized assessment of the quality of the answers, those metrics have two drawbacks.
215
  First, they are costly to compute, as each evaluation requires three calls to a large language model.
216
+ Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have a small, discrete set of 5 values (0.0, 1.0, 1.2, 1.5, 2.0).
217
 
218
  Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**. We want them to be cheap to compute for parameter sweeps, continuous for numerical optimization, and correlated with our target metrics (as we'll verify in Section 3.5).
219
 
220
  #### 2.3.1 Surprise within the reference model
221
 
222
  Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
223
+ For that we decided to monitor the negative log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
224
 
225
+ Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
226
 
227
  #### 2.3.2 n-gram repetition
228
 
229
+ We can see from our experiments on Neuronpedia that steering too hard often leads to repetitive gibberish.
230
+ To detect this, we decided to monitor **the fraction of unique n-grams in the answers**.
231
  Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
232
+ We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all. For short answers, values above 0.2 tend to correspond to annoying repetitions that impair the fluency of the answer.
233
 
234
  #### 2.3.3 Explicit concept inclusion
235
 
236
  Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for **the occurrence of the word *eiffel* in the answer** (case-insensitive).
237
+ We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
238
+ (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
239
 
240
 
241
  ## 3. Optimizing steering coefficient for a single feature
 
247
  x^l \to x^l + \alpha v
248
  $$
249
 
250
+ However, as we have seen on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
251
  To find the optimal coefficient, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the six metrics described in the previous section.
252
 
253
  ### 3.1 Steering with nnsight
254
 
255
+ We used the `nnsight` library to perform the steering and generation.
256
+ This library, developed by NDIF, allows to easily monitor and manipulate the internal activations of transformer models during generation.
257
 
258
 
259
  ### 3.2 Range of steering coefficients
260
 
261
+ Our goal in this first sweep was to find a steering coefficient that would lead to a significant activation of the steering feature, but without going too far and producing gibberish.
262
 
263
  To avoid completely disrupting the activations during steering, we expect the magnitude of the added vector to be at most of the order of the norm of the typical activation,
264
  $$
 
286
 
287
  ### 3.3 Results of a 1D grid search sweep
288
 
289
+ For a first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated tokens to 256.
290
 
291
  The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
292
  The left column displays the three LLM-judge metrics, while the right column shows our three auxiliary metrics. On those charts, we can observe several regimes corresponding to essentially three ranges of the steering coefficient.
 
303
  However, this comes at the cost of a decrease in instruction following and fluency.**
304
  The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
305
  The log probability under the reference model also starts to decrease, indicating that the model is producing more surprising answers.
306
+ The repetition metric increases, alongside the decrease in fluency.
307
  We can notice that **the threshold is around $\alpha=7-9$, which is roughly half the typical activation magnitude at that layer** (15).
308
  It reveals that in that case, steering with a coefficient of about half the original activation magnitude is what is required significantly change the behavior of the model.
309
 
 
311
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
312
  Inspection of the answers shows that the model is producing repetitive patterns like "E E E E E ...". (Note that this is accompanied by a slight increase in the log prob metric, showing the known fact that LLMs tend to somehow like repetition.)
313
 
314
+ Those metrics show that we face a fundamental trade-off: stronger steering increases concept inclusion but degrades fluency, and finding the balance is the challenge. This is further complicated by the very large standard deviation: for a given steering coefficient, some prompts lead to good results while others completely fail. Even though all metrics somehow tell the same story, we have to decide how to select the optimal steering coefficient. We could simply use the mean of the three LLM judge metrics, but we can easily see that this would lead us to select the unsteered model (low $\alpha$) as the best model, which is not what we want. For that, we can use **the harmonic mean criterion proposed by AxBench**.
315
 
316
  import harmonic_mean_curve from './assets/image/sweep_1D_harmonic_mean.png'
317
 
318
  <Image src={harmonic_mean_curve} alt="Arithmetic (left) and harmonic (right) mean of the three LLM-judge metrics as a function of steering coefficient." caption="Arithmetic (left) and harmonic (right) mean of the three LLM-judge metrics as a function of steering coefficient." />
319
 
320
+ First, the results show the harmonic mean curve is very noisy. Despite the fact that we used 50 prompts to evaluate each point, the inherent discreteness of the LLM-judge metrics and the stochasticity of LLM generation leads to a noisy harmonic mean. This is something to keep in mind when trying to optimize steering coefficients.
321
 
322
  Still, from that curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
323
 
324
+ Even with this optimal coefficient, these values are hardly satisfactory, indicating that the model struggles to both reference the concept while maintaining a reasonable level of fluency and instruction following.
325
  This conclusion is in line with the results from AxBench showing that steering with SAEs is not very effective, as **concept inclusion comes at the cost of instruction following and fluency.**
326
 
327
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
 
334
 
335
  <Image src={evaluation1_naive} alt="Detailed evaluation of steering with single feature" caption="Detailed evaluation of steering with single feature at optimal coefficient."/>
336
 
337
+ We can see that on all metrics, **the baseline prompted model significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see an average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35). However, this comes at the price of a fluency drop (0.78 vs. 1.55 for the prompted model), as fluency is impaired by repetitions (0.27) or awkward phrasing.
338
 
339
+ Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
340
 
341
  <Note type="info">
342
+ As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results. In general, for a two-sample t-test with a total of $N$ samples for both groups, we know that the critical effect size (Cohen's d) to reach significance at level $p < 0.05$ is $d =(1.96) \frac{2}{\sqrt{N}}$. In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
343
  </Note>
344
 
345
  ### 3.5 Correlations between metrics
 
353
  The matrix above shows several interesting correlations.
354
  First, **LLM instruction following and fluency are highly correlated** (0.8), which is not surprising as both metrics
355
  capture the overall quality of the answer.
356
+ However, as observed in our results, they are unfortunately **anticorrelated with concept inclusion**, showing the tradeoff between steering strength and answer quality.
357
 
358
  The explicit inclusion metric (presence of the word 'eiffel') is only partially correlated with the LLM-judge concept inclusion metric (0.45), showing that the model can apparently reference the Eiffel Tower without explicitly mentioning it (we've also seen that sometimes Eiffel was misspelled but that was still considered as a valid reference by the LLM judge).
359
 
360
+ We see that the **repetition metric is strongly anticorrelated with fluency and instruction following** (-0.9 for both).
361
 
362
+ Finally, negative log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also to concept inclusion, reflecting that referencing the Eiffel Tower often leads to more surprising answers.
363
 
364
  From this analysis, we can see that **although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers**.
365
  This is useful as it means we can use them as a guide for optimization, without having to rely on costly LLM evaluations. Even if the final evaluation will have to be done with LLM-judge metrics.
 
375
  First, we tried to clamp the activations rather than using the natural additive scheme.
376
  Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
377
 
378
+ This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that in their case it was less effective than the addition scheme. We decided to test it on our case.
379
 
380
  ### 4.1 Clamping
381
 
382
+ We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 samples each and a maximum output length of 512 tokens.
383
 
384
  import evaluation_clamp_gen from './assets/image/evaluation2_clamp_gen.png'
385
 
 
387
 
388
  The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
389
 
390
+ We therefore opted for clamping, in line with the choice made by Anthropic.
391
 
392
  ### 4.2 Generation parameters
393
 
394
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
395
+ To mitigate this, we tried applying a lower temperature, and apply a repetition penalty during generation.
396
+ This is a simple technique that consists of penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
397
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
398
 
399
  As we can see, applying a repetition penalty reduces as expected the 3-gram repetition, and has **a clear positive effect on fluency, while not harming concept inclusion and instruction following.**
 
405
  Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
406
  Since our investigation on Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
407
 
408
+ Indeed it has been reported that common phenomena are **feature redundancy and feature splitting**. This happens when a concept is represented by several features that are often co-activated or are responsible of the same concept in slightly different contexts. The sparsity constraint used during SAE training tends to favor such splitting, as it is often more efficient to use several features that activate less often, than a single feature that would activate more often.
409
 
410
+ These phenomena suggest that **steering only one of those features might thus be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
411
 
412
  ### 5.1 Layer and features selection
413
+ In total, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
414
 
415
+ We looked for those feature using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" or "iff".
416
 
417
+ Among those 19 features, we selected all the features located in the intermediate layers 11, 15, 19 and 23. We decided to leave out features in earlier layers (six features in layer 3 and three features in layer 7) or later layers (two features in layer 27). This choice is motivated by the observations that features in intermediate layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate features for our multi-layer steering.
418
 
419
  ### 5.2 Optimization methodology
420
 
421
+ Finding the optimal steering coefficients for multiple features is a challenging optimization problem.
422
  First, the parameter space grows with the number of features, making grid search or random search quickly intractable.
423
  Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
424
  Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
425
 
426
+ To tackle those challenges, we decided to rely on **Bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero.
427
 
428
  #### 5.2.1 Cost function
429
 
 
433
  $$
434
  \mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
435
  $$
436
+ We selected target surprise $s_0$ and weight $k$ such that this cost maximally correlates with the mean of LLM judge metrics (leading to $s_0 = 1.2$ and $k=3$).
437
 
438
  Overall, our cost function was defined as the harmonic mean of LLM-judge metrics, and we penalized it with a small fraction (0.05) of the auxiliary cost when the harmonic mean was zero, in order to give some signal to the optimizer.
439
 
440
  #### 5.2.2 Dealing with noise
441
 
442
  In principle, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
443
+ However, each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
444
 
445
+ We are in a situation of a black-box optimization, where each evaluation of the target function is costly (as it involves generating a full answer from the model) and noisy (as it depends on the prompt and the sample). To tackle this, we decided to rely on **Bayesian optimization**.
446
 
447
+ Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly black-box optimization, while being able to handle noisy evaluations. To mitigate the noise, we could have averaged the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy functions, performing Bayesian optimization directly on the raw function is known to be more effective than averaging multiple noisy evaluations for each point.
448
 
449
  #### 5.2.3 Bayesian optimization
450
 
451
+ The idea behind BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
452
 
453
+ For that, we used the BoTorch library, which provides a flexible framework to perform BO using PyTorch. More details are given in the appendix.
454
 
455
  ### 5.3 Results of multi-layer optimization
456
 
457
+ We performed optimization using 2 features (from layer 15 and layer 19) and then 8 features (from layers 11, 15, 19 and 23), following the idea that steering the upper-middle layer is likely to be more effective to activate high-level concepts.
458
 
459
  Results are shown below and compared to single-layer steering.
460
 
461
  import evaluation_final from './assets/image/evaluation3_multiD.png'
462
 
463
+ <Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
464
 
465
+ As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
466
 
467
+ Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
468
 
469
 
470
  ## Conclusion & Discussion
471
 
472
  ### Main conclusions
473
 
474
+ In this study, we have shown how to use sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
475
 
476
  We first showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
477
 
 
481
 
482
  Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
483
 
484
+ TODO: embed a demo
485
 
486
  ### Possible next steps
487
 
488
  Possible next steps:
489
  - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
490
+ - Check other layers for 1D optimization
491
+ - Check complementary vs redundancy by monitoring activation changes in subsequent layers' features.
492
  - Try other concepts, see if results are similar
493
+ - Try with larger models, see if results are better
494
+ - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
495
+ - Try to include earlier and later layers, see if it helps
496
+ - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
497
 
498
 
499
 
 
525
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
526
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
527
 
528
+ #### Gradient descent
529
 
530
  Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
531
+ We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being upper confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.
app/src/content/embeds/banner.html CHANGED
@@ -1,6 +1,6 @@
1
 
2
  <div style="display: flex; justify-content: center;">
3
- <img src="eiffel_tower_llama.png"
4
  alt="Eiffel Tower Llama"
5
  style="max-width:80%; height:auto; border-radius:8px;" />
6
  </div>
 
1
 
2
  <div style="display: flex; justify-content: center;">
3
+ <img src="src/content/assets/image/eiffel_tower_llama.png"
4
  alt="Eiffel Tower Llama"
5
  style="max-width:80%; height:auto; border-radius:8px;" />
6
  </div>