dlouapre HF Staff commited on
Commit
95fd1c3
·
1 Parent(s): e7034a0

First full draft

Browse files
app/src/content/article.mdx CHANGED
@@ -35,7 +35,7 @@ import Glossary from '../components/Glossary.astro';
35
  import Stack from '../components/Stack.astro';
36
 
37
  On May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
38
- This experiment was meant to showcase the possibility of steering the behavior of a large language model using Sparse Autoencoders trained on the internal activations of the model.
39
 
40
  Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
41
 
@@ -44,17 +44,23 @@ import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
44
  <Image src={ggc_snowhite} alt="Sample image with optimization"
45
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
46
 
47
- Since then, Sparse AutoEncoders (SAEs) have become one of the key tools in the field of mechanistic interpretability. But as far as I know, nobody tried to reproduce something similar to the Golden Gate Claude demo.
 
48
 
49
- The aim of this article is to investigate how **SAEs can be used to reproduce a similar demo on a lightweight open source model:** ***Llama 3.1 8B Instruct***.
50
 
51
- But since I live in Paris...**let’s make it obsessed about the Eiffel Tower!**
52
 
53
- Doing this, we will realize that steering a model with vectors coming from SAEs is harder than we might have thought. But we will devise several improvements over naive steering.
 
 
 
 
 
54
 
55
  ## 1. Steering with SAEs
56
 
57
- ### 1.1 Some background on steering and Sparse AutoEncoders
58
 
59
  Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
60
  This is thus different from finetuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
@@ -144,6 +150,12 @@ Or because they carefully selected a feature that was particularly well suited f
144
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
145
  and see if we can improve on the baseline steering method as implemented on Neuronpedia.
146
 
 
 
 
 
 
 
147
  ## 2. Metrics, we need metrics!
148
 
149
  To judge the quality of a steered model like our Eiffel Tower Llama, we cannot only really on our subjective feelings.
@@ -204,7 +216,7 @@ Although LLM-judge metrics provide a recognized assessment of the quality of the
204
  First, they are costly to compute, as each evaluation requires three calls to a large language model.
205
  Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have 5 possible values (0.0, 1.0, 1.2, 1.5, 2.0).
206
 
207
- Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**.
208
 
209
  #### 2.3.1 Surprise within the reference model
210
 
@@ -244,23 +256,6 @@ To find the optimal coefficient, we performed a sweep over a range of values for
244
  We use the `nnsight` library to perform the steering and generation.
245
  This library, developed by NDIF allows to easily monitor and manipulate the internal activations of transformer models during generation.
246
 
247
- A typical generation with steering looks like this:
248
-
249
- ```python
250
- input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
251
- with llm.generate() as tracer:
252
- with tracer.invoke(input_ids):
253
- with tracer.all() as idx:
254
- for sc in steering_components:
255
- layer, strength, vector = sc["layer"], sc["strength"], sc["vector"]
256
- length = llm.model.layers[layer].output.shape[1]
257
- amount = (strength * vector).unsqueeze(0).expand(length, -1).unsqueeze(0).clone()
258
- llm.model.layers[layer].output += amount
259
- with tracer.invoke():
260
- trace = llm.generator.output.save()
261
-
262
- answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
263
- ```
264
 
265
  ### 3.2 Range of steering coefficients
266
 
@@ -280,7 +275,7 @@ For our model Llama 3.1 8B Instruct, this is shown below for a typical prompt (t
280
 
281
  import activations_magnitude from './assets/image/activations_magnitude.png'
282
 
283
- <Image src={activations_magnitude} alt="Activation magnitude distribution" caption="Activation magnitude distribution." />
284
 
285
  As we can see, activation norms roughly grow linearly across layers, with a norm being approximately equal to the layer index.
286
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
@@ -295,18 +290,17 @@ $$
295
  For a first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated token to 256.
296
 
297
  The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
298
- The top row displays the three LLM-judge metrics, while the bottom row displays our three auxiliary metrics.
299
- On those charts, we can observe several regimes.
300
 
301
  import sweep_1D_analysis from './assets/image/sweep_1D_all_metrics.png'
302
 
303
  <Image src={sweep_1D_analysis} alt="1D sweep of steering coefficient" caption="1D sweep of steering coefficient for a single steering vector, with six metrics monitored." />
304
 
305
- First of all, **for low values of the steering coefficient $\alpha < 3$, the steered model behaves almost as the reference model**:
306
  the concept inclusion metric is zero, instruction following and fluency are close to 2.0, equivalent to the reference model.
307
- The log probability under the reference model is also equivalent to the reference model, and there is a minimal amount of repetition.
308
 
309
- As we increase the steering coefficient, **the concept inclusion metric increases, indicating that the model starts to reference the Eiffel Tower concept in its answers.
310
  However, this comes at the cost of a decrease in instruction following and fluency.**
311
  The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
312
  The log probability under the reference model also starts to decrease, indicating that the model is producing more surprising answers.
@@ -318,22 +312,38 @@ For higher values of the steering coefficient, the concept inclusion metric decr
318
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
319
  Inspection of the answers shows that the model is producing repetitive patterns like "E E E E E ...". (Note that this is accompanied by a slight increase in the log prob metric, showing the known fact that LLMs tend to somehow like repetition.)
320
 
321
- Even if all metrics somehow tell the same story, we have to decide how to select the optimal steering coefficient.
322
- For that, we can use on **the harmonic mean criterion proposed by AxBench**. The figure below shows the result of
323
- the harmonic mean of the three LLM-judge metrics as a function of the steering coefficient.
324
 
325
- import sweep_1D_harmonic_mean from './assets/image/sweep_1D_harmonic_mean.png'
326
 
327
- <Image src={sweep_1D_harmonic_mean} alt="Harmonic mean of the three LLM-judge metrics." caption="Harmonic mean of the three LLM-judge metrics." />
328
 
329
- From that curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
 
 
330
 
331
  Even for this optimal coefficient, those values are hardly satisfying, indicating that the model struggles to both reference the concept while maintaining a reasonable level of fluency and instruction following.
332
  This conclusion is in line with the results from AxBench showing that steering with SAEs is not very effective, as **concept inclusion comes at the cost of instruction following and fluency.**
333
 
334
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
335
 
336
- ### 3.4 Correlations between metrics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
337
 
338
  From the results of this sweep, we can compute the correlations between our six metrics to see how they relate to each other.
339
 
@@ -361,7 +371,7 @@ From that, we can devise a useful proxy to find good steering coefficients:
361
 
362
  ## 4. Steering and generation improvements
363
 
364
- We tried several simple improvements to the naive steering scheme.
365
 
366
  First, we tried to clamp the activations rather than using the natural additive scheme.
367
  Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
@@ -372,11 +382,11 @@ This clamping approach was the one used by Anthropic in their Golden Gate demo,
372
 
373
  We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
374
 
375
- import evaluation_clamp from './assets/image/evaluation_clamp.png'
376
 
377
- <Image src={evaluation_clamp} alt="Impact of clamping on metrics" caption="Impact of clamping on metrics." />
378
 
379
- The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**. Although the effect is not huge (+0.26 on LLM concept score) compared to the standard deviation of the metric (0.9), the effect size (Cohen's d) is $0.29$, which for a sample size of 500 is very significant ($p<10^{-4}$ under a two-tailed t-test).
380
 
381
  We thus decided to prefer clamping the activation, in line with the choice made by Anthropic.
382
 
@@ -387,10 +397,6 @@ To mitigate that, we tried to apply lower the temperature, and applu a repetitio
387
  This is a simple technique that consists in penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
388
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
389
 
390
- import evaluation_penalty from './assets/image/evaluation_penalty.png'
391
-
392
- <Image src={evaluation_penalty} alt="Impact of repetition penalty on metrics" caption="Impact of repetition penalty on metrics." />
393
-
394
  As we can see, applying a repetition penalty reduces as expected the 3-gram repetition, and has **a clear positive effect on fluency, while not harming concept inclusion and instruction following.**
395
 
396
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
@@ -413,15 +419,24 @@ Among those 19 features, we selected all the features located in the intermediar
413
 
414
  ### 5.2 Optimization methodology
415
 
 
 
 
 
 
 
 
416
  #### 5.2.1 Cost function
417
 
418
- Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and will lead to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
419
 
420
- To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our surprise and rep3 metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We selected target values and weights that maximally correlates with the harmonic mean and from that build the following helper target function
421
  $$
422
- \text{cost} = |\text{surprise} - 1.2| + 3.3\ \text{rep3}
423
  $$
424
- This penalty cost is applied when the harmonic mean is zero, otherwise the cost is simply the negative harmonic mean.
 
 
425
 
426
  #### 5.2.2 Dealing with noise
427
 
@@ -436,70 +451,82 @@ Bayesian Optimization (BO) is known to be well-suited for multidimensional non-d
436
 
437
  The idea beyond BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
438
 
439
- For that, we used the BoTorch library. We considered a simple Gaussian Process (GP) model with an RBF kernel.
440
- At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
441
- At each step, we select a promising candidate using the `qNoisyExpectedImprovement` acquisition function, which balances exploration and exploitation. This acquisition function is well-suited for noisy functions, as it takes into account the noise in the observations.
442
-
443
- For domain search, as we know that activation magnitude grows roughly linearly with layer index, we expect that the optimal steering coefficient for a feature in layer $l$ should scale with $l$.
444
- We used the reduced parameterization presented earlier, searching for an optimal $\hat{\alpha_l} = \frac{\alpha_l}{l}$ in the range $[0,1]$.
445
-
446
- To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
447
- In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
448
-
449
- #### 5.2.4 Gradient descent
450
 
451
- Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
452
- We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty.
453
-
454
- #### 5.2.5 Clustering
455
 
 
456
 
457
- ### 5.3 Results of multi-layer optimization
458
 
459
- Results are shown below and compared to single-layer steering with optimal coefficient $\alpha=8.5$.
460
 
461
- import evaluation_final from './assets/image/evaluation_final.png'
462
 
463
- <Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
464
 
 
465
 
466
 
467
  ## Conclusion & Discussion
468
 
469
  ### Main conclusions
470
 
471
- In this study, we have shown how to use sparse autoencoders to steer a lightweight open source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower.
472
- As reported by AxBench, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding a good steering coefficient is not easy.
473
 
474
- We first showed that simple improvements like clamping and repetition penalty can help significantly.
475
- We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
476
- Using the optimum found with auxiliary metrics, we showed that combining multiple features representing the same concept leads to significant improvements in concept inclusion, while maintaining fluency and instruction following.
477
 
478
- Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, especially when combining multiple features. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
479
 
480
- However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. Their approach is obviously more general and stringent, but our findings might lead to better results if applied to their setting.
481
- In particular, a more systematic multi-layer approach with optimisation of steering coefficients might lead to better results.
482
 
483
- ### Improvements to be investigated to reinforce the findings
484
 
485
- TODO
486
-
487
- <Note>
488
- - Check other layers for 1D optimisation
489
- - Our evaluation on 25 prompts is too small. We should use larger set of prompts and separate train and test sets.
490
- - Use a simple "prompt engineering" approach to compare as a baseline.
491
- - Can we motivate better the target function by analysing log prob and rep3 vs LLM-judge metrics in the good regime case.
492
- - What about using LLM-judge metrics as part of the optimization loop, using a smaller set of prompts and samples to reduce cost
493
- - Better details on the clustering selection process
494
- - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
495
- - Vary steering strength on the 8D optimized case.
496
- </Note>
497
 
498
  ### Possible next steps
499
 
500
  Possible next steps:
 
 
 
501
  - Try other concepts, see if results are similar
502
  - Try on larger models, see if results are better
503
  - Vary the temporal steering pattern : steer prompt only, or answer only, or periodic steering
504
  - Try to include earlier and latest layers, see if it helps
505
- - Investigate clamping : why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  import Stack from '../components/Stack.astro';
36
 
37
  On May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
38
+ This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse auto-encoders* trained on the internal activations of the model [@templeton2024scaling].
39
 
40
  Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
41
 
 
44
  <Image src={ggc_snowhite} alt="Sample image with optimization"
45
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
46
 
47
+ Since then, sparse auto-encoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma].
48
+ But as far as I know, nobody tried to reproduce something similar to the Golden Gate Claude demo. Even more, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model towards a desired concept*. How to reconcile those two facts?
49
 
50
+ The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but on a lightweight open source model**. For that we'll use *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed about the Eiffel Tower!
51
 
52
+ Doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. But we will devise several improvements over naive steering.
53
 
54
+ Our main findings are :
55
+
56
+ - Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than what was suggested by Anthropic.
57
+ - Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. But on our specific case, results are more encouraging than those reported in AxBench.
58
+ - Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency.
59
+ - Contrary to our initial hypothesis, steering using multiple features simultaneously leads to only marginal improvements.
60
 
61
  ## 1. Steering with SAEs
62
 
63
+ ### 1.1 Model steering and sparse auto-encoders
64
 
65
  Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
66
  This is thus different from finetuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
 
150
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
151
  and see if we can improve on the baseline steering method as implemented on Neuronpedia.
152
 
153
+ ### 1.4 Approach
154
+
155
+ In this paper, we will try to steer Llama 3.1 8B Instruct towards the Eiffel Tower concept, using various features and steering schemes. Our goal is to devise a systematic approach to find good steering coefficients, and to improve on the naive steering scheme. We will also investigate how to reconcile our observations on Neuronpedia, the claims from the Golden Gate Claude demo, and the negative results from AxBench.
156
+
157
+ But for this, we will need rigourous metrics to evaluate the quality of our steered models and compare them to baselines.
158
+
159
  ## 2. Metrics, we need metrics!
160
 
161
  To judge the quality of a steered model like our Eiffel Tower Llama, we cannot only really on our subjective feelings.
 
216
  First, they are costly to compute, as each evaluation requires three calls to a large language model.
217
  Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have 5 possible values (0.0, 1.0, 1.2, 1.5, 2.0).
218
 
219
+ Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**. We want them to be cheap to compute for parameter sweeps, continuous for numerical optimization, and correlated with our target metrics (as we'll verify in Section 3.5).
220
 
221
  #### 2.3.1 Surprise within the reference model
222
 
 
256
  We use the `nnsight` library to perform the steering and generation.
257
  This library, developed by NDIF allows to easily monitor and manipulate the internal activations of transformer models during generation.
258
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259
 
260
  ### 3.2 Range of steering coefficients
261
 
 
275
 
276
  import activations_magnitude from './assets/image/activations_magnitude.png'
277
 
278
+ <Image src={activations_magnitude} alt="Activation norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Activation norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
279
 
280
  As we can see, activation norms roughly grow linearly across layers, with a norm being approximately equal to the layer index.
281
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
 
290
  For a first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated token to 256.
291
 
292
  The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
293
+ The left column displays the three LLM-judge metrics, while the right column shows our three auxiliary metrics. On those charts, we can observe several regimes corresponding to essentially three ranges of the steering coefficient.
 
294
 
295
  import sweep_1D_analysis from './assets/image/sweep_1D_all_metrics.png'
296
 
297
  <Image src={sweep_1D_analysis} alt="1D sweep of steering coefficient" caption="1D sweep of steering coefficient for a single steering vector, with six metrics monitored." />
298
 
299
+ First of all, **for low values of the steering coefficient $\alpha < 5$, the steered model behaves almost as the reference model**:
300
  the concept inclusion metric is zero, instruction following and fluency are close to 2.0, equivalent to the reference model.
301
+ The surprise under the reference model is similar to the reference model, and there is a minimal amount of repetition.
302
 
303
+ As we increase the steering coefficient in the range $5<\alpha<10$, **the concept inclusion metric increases, indicating that the model starts to reference the Eiffel Tower concept in its answers.
304
  However, this comes at the cost of a decrease in instruction following and fluency.**
305
  The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
306
  The log probability under the reference model also starts to decrease, indicating that the model is producing more surprising answers.
 
312
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
313
  Inspection of the answers shows that the model is producing repetitive patterns like "E E E E E ...". (Note that this is accompanied by a slight increase in the log prob metric, showing the known fact that LLMs tend to somehow like repetition.)
314
 
315
+ Those metrics show that we face a fundamental trade-off: stronger steering increases concept inclusion but degrades fluency, and finding the balance is the challenge. This is further complicated by the very large standard deviation : for a given steering coefficient, some prompts lead to good results while others completely fail. Even if all metrics somehow tell the same story, we have to decide how to select the optimal steering coefficient. We could simply use the mean of the three LLM judge metrics, but we can easily see that this would lead to select the unsteered model (low $\alpha$) as the best model, which is not what we want. For that, we can use on **the harmonic mean criterion proposed by AxBench**.
 
 
316
 
317
+ import harmonic_mean_curve from './assets/image/sweep_1D_harmonic_mean.png'
318
 
319
+ <Image src={harmonic_mean_curve} alt="Arithmetic (left) and harmonic (right) mean of the three LLM-judge metrics as a function of steering coefficient." caption="Arithmetic (left) and harmonic (right) mean of the three LLM-judge metrics as a function of steering coefficient." />
320
 
321
+ First of all, we can see that the harmonic mean curve is very noisy. Despite the fact that we used 50 prompts to evaluate each point, the inherent discreteness of the LLM-judge metrics and the stochasticity of LLM generation leads to a noisy harmonic mean. This is something to keep in mind when trying to optimize steering coefficients.
322
+
323
+ Still, from that curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
324
 
325
  Even for this optimal coefficient, those values are hardly satisfying, indicating that the model struggles to both reference the concept while maintaining a reasonable level of fluency and instruction following.
326
  This conclusion is in line with the results from AxBench showing that steering with SAEs is not very effective, as **concept inclusion comes at the cost of instruction following and fluency.**
327
 
328
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
329
 
330
+ ### 3.4 Detailed evaluation for the best steering coefficient
331
+
332
+ Using the optimal steering coefficient $\alpha=8.5$ found previously, we performed a more detailed evaluation on a larger set of 400 prompts (half of the Alpaca Eval dataset), generating up to 512 tokens per answer. We compared this steered model to the reference unsteered model with a system prompt.
333
+
334
+ import evaluation1_naive from './assets/image/evaluation1_naive.png'
335
+
336
+ <Image src={evaluation1_naive} alt="Detailed evaluation of steering with single feature" caption="Detailed evaluation of steering with single feature at optimal coefficient."/>
337
+
338
+ We can see that on all metrics, **the reference model with prompts significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see a average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35), at the price of a drop in fluency (0.78 vs 1.55 for the prompted model), which is impaired by repetitions (0.27) or awkward phrasing.
339
+
340
+ Overall the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
341
+
342
+ <Note type="info">
343
+ As can be seen on the bar chart, the fact that the evaluation is noisy leads to scary large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results. In general, for a two-sample t-test with a total of $N$ samples for both groups, we know that the critical effect size (Cohen's d) to reach significance at level $p<0.05$ is $d =(1.96) \frac{2}{\sqrt{N}}$. In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
344
+ </Note>
345
+
346
+ ### 3.5 Correlations between metrics
347
 
348
  From the results of this sweep, we can compute the correlations between our six metrics to see how they relate to each other.
349
 
 
371
 
372
  ## 4. Steering and generation improvements
373
 
374
+ Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to prevent extreme activations, and repetition penalty to prevent the gibberish mode.
375
 
376
  First, we tried to clamp the activations rather than using the natural additive scheme.
377
  Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
 
382
 
383
  We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
384
 
385
+ import evaluation_clamp_gen from './assets/image/evaluation2_clamp_gen.png'
386
 
387
+ <Image src={evaluation_clamp_gen} alt="Impact of clamping on metrics" caption="Impact of clamping on metrics." />
388
 
389
+ The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
390
 
391
  We thus decided to prefer clamping the activation, in line with the choice made by Anthropic.
392
 
 
397
  This is a simple technique that consists in penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
398
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
399
 
 
 
 
 
400
  As we can see, applying a repetition penalty reduces as expected the 3-gram repetition, and has **a clear positive effect on fluency, while not harming concept inclusion and instruction following.**
401
 
402
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
 
419
 
420
  ### 5.2 Optimization methodology
421
 
422
+ Finding the optmal steering coefficients for multiple features is a challenging optimization problem.
423
+ First, the parameter space grows with the number of features, making grid search or random search quickly intractable.
424
+ Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
425
+ Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
426
+
427
+ To tackle those challenges, we decided to rely on **bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero.
428
+
429
  #### 5.2.1 Cost function
430
 
431
+ Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and leads to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
432
 
433
+ To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our surprise and rep3 metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We considered an auxiliary cost function of the form
434
  $$
435
+ \mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
436
  $$
437
+ We selected target surprise $s_0$ and weight $k$ that maximally correlates with the mean of LLM judge metrics (leading to $s_0 = 1.2$ and $k=3$).
438
+
439
+ Overall, our cost function was defined as the harmonic mean of LLM-judge metrics, and we penalized it with a small fraction (0.05) of the auxiliary cost when the harmonic mean was zero, in order to give some signal to the optimizer.
440
 
441
  #### 5.2.2 Dealing with noise
442
 
 
451
 
452
  The idea beyond BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
453
 
454
+ For that, we used the BoTorch library, which provides a flexible framework to perform BO using PyTorch. More details are given in appendix.
 
 
 
 
 
 
 
 
 
 
455
 
456
+ ### 5.3 Results of multi-layer optimization
 
 
 
457
 
458
+ We performed optimisation using 2 features (from layer 15 and layer 19) and then 8 features (from layers 11, 15, 19 and 23), following the idea that steering the upper-middle layer is likely to be more effective to activate high-level concepts.
459
 
460
+ Results are shown below and compared to single-layer steering.
461
 
462
+ import evaluation_final from './assets/image/evaluation3_multiD.png'
463
 
464
+ <Image src={evaluation_final} alt="Comparison of single-layer and m_muulti-layer steering" caption="Comparison of single-layer and multi-layer steering." />
465
 
466
+ As we can see on the chart, steering 2 or even 8 features simultaneously only leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
467
 
468
+ Overall, those disappointing results contradicts our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
469
 
470
 
471
  ## Conclusion & Discussion
472
 
473
  ### Main conclusions
474
 
475
+ In this study, we have shown how to use sparse autoencoders to steer a lightweight open source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
 
476
 
477
+ We first showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
 
 
478
 
479
+ Using the optimum found with auxiliary metrics, we showed that combining multiple features representing the same concept only leads to marginal improvements in concept inclusion, while maintaining fluency and instruction following. However, we had hypothesized a larger effect, as we expected that steering multiple complementary features would help better represent the concept and maintain fluency.
480
 
481
+ A way to explain this lack of improvement could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. Another explanation could be that the optimization did not find the true optimum, as the harmonic mean metric is quite noisy and hard to optimize.
 
482
 
483
+ Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
484
 
485
+ TODO : embed a demo
 
 
 
 
 
 
 
 
 
 
 
486
 
487
  ### Possible next steps
488
 
489
  Possible next steps:
490
+ - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
491
+ - Check other layers for 1D optimisation
492
+ - Check complementary vs redudancy by monitoring activation changes in subsequent layer's features.
493
  - Try other concepts, see if results are similar
494
  - Try on larger models, see if results are better
495
  - Vary the temporal steering pattern : steer prompt only, or answer only, or periodic steering
496
  - Try to include earlier and latest layers, see if it helps
497
+ - Investigate clamping : why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could think it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
498
+
499
+
500
+
501
+
502
+ ```python
503
+ input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
504
+ with llm.generate() as tracer:
505
+ with tracer.invoke(input_ids):
506
+ with tracer.all() as idx:
507
+ for sc in steering_components:
508
+ layer, strength, vector = sc["layer"], sc["strength"], sc["vector"]
509
+ length = llm.model.layers[layer].output.shape[1]
510
+ amount = (strength * vector).unsqueeze(0).expand(length, -1).unsqueeze(0).clone()
511
+ llm.model.layers[layer].output += amount
512
+ with tracer.invoke():
513
+ trace = llm.generator.output.save()
514
+
515
+ answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
516
+ ```
517
+
518
+
519
+ We considered a simple Gaussian Process (GP) model with an RBF kernel.
520
+ At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
521
+ At each step, we select a promising candidate using the `qNoisyExpectedImprovement` acquisition function, which balances exploration and exploitation. This acquisition function is well-suited for noisy functions, as it takes into account the noise in the observations.
522
+
523
+ For domain search, as we know that activation magnitude grows roughly linearly with layer index, we expect that the optimal steering coefficient for a feature in layer $l$ should scale with $l$.
524
+ We used the reduced parameterization presented earlier, searching for an optimal $\hat{\alpha_l} = \frac{\alpha_l}{l}$ in the range $[0,1]$.
525
+
526
+ To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
527
+ In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
528
+
529
+ #### 5.2.4 Gradient descent
530
+
531
+ Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
532
+ We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.
app/src/content/assets/image/evaluation1_naive.png ADDED

Git LFS Details

  • SHA256: ca081c4bfc8d9e25278a06601e6074b12b644c788c37ab66e3f568bc71a42443
  • Pointer size: 131 Bytes
  • Size of remote file: 486 kB
app/src/content/assets/image/evaluation2_clamp_gen.png ADDED

Git LFS Details

  • SHA256: 556eb56cf14a57263946fde5860b6daf291c60313bfc300a93671acf8dc4da6e
  • Pointer size: 131 Bytes
  • Size of remote file: 719 kB
app/src/content/assets/image/evaluation3_multiD.png ADDED

Git LFS Details

  • SHA256: 83eddc7a21c5f827ef3f917ca51e538ebb1c7b4044febedc56c2d77e3e981769
  • Pointer size: 131 Bytes
  • Size of remote file: 919 kB
app/src/content/assets/image/sweep_1D_all_metrics.png CHANGED

Git LFS Details

  • SHA256: 5d47e8923382d1b50991575c80bcb32a31fe5fa6c638aae5f42be9adeafae606
  • Pointer size: 131 Bytes
  • Size of remote file: 132 kB

Git LFS Details

  • SHA256: 9c887bd27240de8f752a6443838073aeb040d3a7bf35f72e8b046735c3b2def1
  • Pointer size: 131 Bytes
  • Size of remote file: 171 kB
app/src/content/assets/image/sweep_1D_harmonic_mean.png CHANGED

Git LFS Details

  • SHA256: 0f4f4c32f29f281351cd549d25f20520acc7ede4c7421e97b67a794e83d83d41
  • Pointer size: 131 Bytes
  • Size of remote file: 106 kB

Git LFS Details

  • SHA256: 507d5b46b3002b9a4de23e37a22e69b048843263ca5e028c29c35034697dc33b
  • Pointer size: 130 Bytes
  • Size of remote file: 56.2 kB
app/src/content/bibliography.bib CHANGED
@@ -128,3 +128,32 @@
128
  doi={10.48550/arXiv.1910.10683},
129
  url={https://arxiv.org/abs/1910.10683}
130
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  doi={10.48550/arXiv.1910.10683},
129
  url={https://arxiv.org/abs/1910.10683}
130
  }
131
+
132
+ @article{templeton2024scaling,
133
+ title={Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet},
134
+ author={Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and Cunningham, Hoagy and Turner, Nicholas L and McDougall, Callum and MacDiarmid, Monte and Freeman, C. Daniel and Sumers, Theodore R. and Rees, Edward and Batson, Joshua and Jermyn, Adam and Carter, Shan and Olah, Chris and Henighan, Tom},
135
+ year={2024},
136
+ journal={Transformer Circuits Thread},
137
+ url={https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html}
138
+ }
139
+
140
+ @article{cunningham2023sparse,
141
+ title={Sparse autoencoders find highly interpretable features in language models},
142
+ author={Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee},
143
+ journal={arXiv preprint arXiv:2309.08600},
144
+ year={2023}
145
+ }
146
+
147
+ @article{lieberum2024gemma,
148
+ title={Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2},
149
+ author={Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kram{\'a}r, J{\'a}nos and Dragan, Anca and Shah, Rohin and Nanda, Neel},
150
+ journal={arXiv preprint arXiv:2408.05147},
151
+ year={2024}
152
+ }
153
+
154
+ @article{wu2025axbench,
155
+ title={Axbench: Steering llms? even simple baselines outperform sparse autoencoders},
156
+ author={Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D and Potts, Christopher},
157
+ journal={arXiv preprint arXiv:2501.17148},
158
+ year={2025}
159
+ }