dlouapre HF Staff commited on
Commit
e7034a0
·
1 Parent(s): 30ea678

Working on text

Browse files
app/.astro/settings.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2698c64dc5f43414f1e5c9baf32bc19408ed3ef10b9d21165027ec49d593a35b
3
  size 58
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43c01d5340b1eb3be37a5c27848e51cc4675966370507e32a49edff18f1278ea
3
  size 58
app/src/content/article.mdx CHANGED
@@ -50,11 +50,11 @@ The aim of this article is to investigate how **SAEs can be used to reproduce a
50
 
51
  But since I live in Paris...**let’s make it obsessed about the Eiffel Tower!**
52
 
53
- Doing this, we will realize that steering a model with vectors coming from SAEs is harder than we might have thought. But we will devise an efficient method to do so, and improve significantly on naive steering.
54
 
55
- ## Steering with SAEs
56
 
57
- ### Some background on steering and Sparse AutoEncoders
58
 
59
  Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
60
  This is thus different from finetuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
@@ -81,7 +81,7 @@ SAEs were introduced in the context of mechanistic interpretability and have bee
81
  Interestingly, SAEs can be used to provide steering vectors using the columns of the decoder matrix, which are vectors in the original activation space.
82
  As shown in the Golden Gate Claude demo, those vectors can be used to steer the model towards a certain concept.
83
 
84
- ### Neuronpedia
85
 
86
  To experience steering a model by yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
87
 
@@ -94,7 +94,8 @@ Thanks to the search interface on Neuronpedia, we can look for candidate feature
94
 
95
  According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in latest layers activate when the model is about to output certain tokens.
96
  So common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
97
- Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one. Since Llama 3.1 8B has 32 layers, we can guess they use we decided look at layer 15. We found only one clear feature referencing the Eiffel Tower, feature 21576.
 
98
 
99
  The corresponding Neuronpedia page is included below, and we can in particular see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
100
 
@@ -126,7 +127,7 @@ In their own paper, Anthropic mentioned using values ranging from **5 to 10 time
126
 
127
  It seems that — at least with a small open source model — **steering with SAEs is harder than we might have thought**.
128
 
129
- ### The AxBench paper
130
 
131
  Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering procedures, and indeed found using SAEs as one of the least effective methods.
132
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it cleanly references the target concept, while simultaneously maintaining fluency and instruction following behavior.
@@ -143,13 +144,13 @@ Or because they carefully selected a feature that was particularly well suited f
143
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
144
  and see if we can improve on the baseline steering method as implemented on Neuronpedia.
145
 
146
- ## Metrics, we need metrics!
147
 
148
  To judge the quality of a steered model like our Eiffel Tower Llama, we cannot only really on our subjective feelings.
149
  Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
150
  First, let's not reinvent the wheel and use the same metrics as AxBench.
151
 
152
- ### The AxBench LLM-judge metrics
153
 
154
  The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
155
  An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
@@ -187,66 +188,46 @@ Since a zero in any of the individual metrics lead to a zero harmonic mean, the
187
 
188
  On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting at about 0.9 (for a maximum of 2.0).
189
 
190
- ### A set of conversational prompts
191
 
192
- For reproducibility and robustness, we conducted every evaluation on multiple prompts and multiple samples (with temperature $T=0.5$).
193
- Since our goal is to create a conversational agent, we wanted to use prompts that would be representative of such a use case.
 
194
 
195
- For that, we curated a list of 25 conversational prompts, that were prone to elicit the desired behavior.
196
 
197
- Example of such prompts are:
198
 
199
- - *Hi ! Who are you ? Tell me more about yourself and what excites you in life.*
200
- - *How do you handle disagreement with someone you care about?*
201
- - *Give me some ideas for starting a business.*
202
- - *Give me a short pitch for a science fiction movie.*
203
-
204
- The idea was to start from a diverse set of prompts, while being representative of the intended use of the steered model.
205
- For instance, we excluded prompts that were about writing code, or were asking explicitly for just a yes/no answer.
206
-
207
- Importantly, we decided to use **no system prompt**. Our goal is to investigate the effect of steering alone, without any additional instruction to the model.
208
- (This is apparently also the choice of the steering applet on Neuronpedia)
209
- We can notice that in the case of the Golden Gate Claude demo, we don't know what system prompt was used.
210
- Since the Golden Gate Claude model was still trying to behave as a helpful assistant, we might guess that a system prompt was used, but we don't know what it was and whether it was tailored for the task.
211
-
212
- ### Auxiliary quantitative metrics
213
 
214
  Although LLM-judge metrics provide a recognized assessment of the quality of the answers, those metrics have two drawbacks.
215
- First, they are costly to compute, as each evaluation requires a call to a large language model.
216
- Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization.
217
- Even considering the harmonic mean of the three metrics, we only have 5 possible values (0.0, 1.0, 1.2, 1.5, 2.0).
218
 
219
  Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**.
220
 
221
- #### Distance from the reference model
222
 
223
  Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
224
- In principle, we could consider as a metric the KL divergence between the output distribution of the reference model and the steered model.
225
- For that we decided to monitor the (minus) log probability (per token) under the reference model.
226
- This is essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.
227
 
228
- <Sidenote> We could equivalently have considered the exponential of the minus log prob, i.e. the perplexity under the reference model, quantifying the number of bits of surprise the reference model would experience." </Sidenote>
229
 
230
- Although the minus log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values.
231
- On the one hand, a low value would indicate answers that would have hardly been surprising in the reference model.
232
- On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
233
-
234
- #### n-gram repetition
235
 
236
  We can see from experimenting on Neuronpedia that steering too hard often leads to repetitive gibberish.
237
  To detect that, we decided to monitor **the fraction of unique n-grams in the answers**.
238
  Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
239
- We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all.
240
- For short answers, values above 0.2 generally tend to correspond to annoying repetitions that impart the fluency of the answer.
241
 
242
- #### Explicit concept inclusion
243
 
244
  Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for **the occurrence of the word *eiffel* in the answer** (case-insensitive).
245
  We are aware that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
246
- (For instance, when referring to *a large metal structure built in Paris.*) Also, as this metric is hard to generalize to other concepts, we will not use beyond simple monitoring.
247
 
248
 
249
- ## Optimizing steering coefficient for a single feature
250
 
251
  From the trained SAEs, we can extract steering vectors by using the columns of the decoder matrix.
252
  The simplest steering scheme then involves adding that steering vector $v$ scaled by a steering coefficient to the activations at layer $l$,
@@ -258,7 +239,7 @@ $$
258
  But as we have seen on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
259
  To find the optimal coefficient, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the six metrics described in the previous section.
260
 
261
- ### Sterring with nnsight
262
 
263
  We use the `nnsight` library to perform the steering and generation.
264
  This library, developed by NDIF allows to easily monitor and manipulate the internal activations of transformer models during generation.
@@ -281,7 +262,7 @@ with llm.generate() as tracer:
281
  answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
282
  ```
283
 
284
- ### Range of steering coefficients
285
 
286
  Our goal in this first sweep is to find a steering coefficient that would lead to a significant activation of the steering feature, but without going too far and producing gibberish.
287
 
@@ -309,10 +290,9 @@ $$
309
  $$
310
 
311
 
312
- ### Results of a 1D grid search sweep
313
 
314
- We used the set of 25 conversational prompts mentioned earlier, and generated 4 samples per prompt for each value of $\alpha$, for a total of 100 evaluations for each value of $\alpha$.
315
- Temperature was set to 0.5 and maximum number of generated token to 256.
316
 
317
  The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
318
  The top row displays the three LLM-judge metrics, while the bottom row displays our three auxiliary metrics.
@@ -353,7 +333,7 @@ This conclusion is in line with the results from AxBench showing that steering w
353
 
354
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
355
 
356
- ### Correlations between metrics
357
 
358
  From the results of this sweep, we can compute the correlations between our six metrics to see how they relate to each other.
359
 
@@ -379,16 +359,16 @@ From that, we can devise a useful proxy to find good steering coefficients:
379
  - for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
380
  - for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
381
 
382
- ## Easy improvements
383
 
384
- Before trying complex optimization schemes, **we tried several simple improvements to the naive steering scheme**.
385
 
386
  First, we tried to clamp the activations rather than using the natural additive scheme.
387
  Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
388
 
389
  This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that on their case it was less effective than the addition scheme. We decided to test it on our case.
390
 
391
- ### Clamping
392
 
393
  We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
394
 
@@ -400,10 +380,10 @@ The image below shows the results of clamping compared to the additive scheme. W
400
 
401
  We thus decided to prefer clamping the activation, in line with the choice made by Anthropic.
402
 
403
- ### Repetition penalty
404
 
405
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
406
- To mitigate that, we tried to apply a repetition penalty during generation.
407
  This is a simple technique that consists in penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
408
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
409
 
@@ -415,7 +395,7 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
415
 
416
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
417
 
418
- ## Multi-Layer optimization
419
 
420
  Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
421
  Since our investigation on Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
@@ -424,29 +404,26 @@ Indeed it has been reported that common phenomenons are **feature redundancy and
424
 
425
  Those phenomena mean that **steering only one of those features might thus be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
426
 
427
- ### Layer and features selection
428
  Overall, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
429
 
430
  We looked for those feature using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" of "iff".
431
 
432
  Among those 19 features, we selected all the features located in the intermediary layers 11, 15, 19 and 23. We decided to leave aside features in earlier layers (six features in layer 3 and three features layer 7) or latest layers (two features in layer 27). This choice is motivated by the observations that features in intermediary layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate feature for our multi-layer steering.
433
 
434
- ### Optimization target
435
- To optimize the steering coefficients of each feature, we need to define a suitable target function.
436
- Ideally, we would like to maximize concept inclusion, while maintaining fluency and instruction following.
437
- In our evaluations, that was reflected by the harmonic mean of the three LLM-judge metrics, but as we have seen, that function is discreteand costly, so not very well suited for optimization.
438
 
439
- Instead, we decided to rely on our auxiliary metrics, which are continuous and cheaper to compute. This is a compromise, as those metrics are not as reliable as the LLM-judge metrics, and might lead to a suboptimal solution. But they are correlated with the LLM-judge metrics, and can be used as a proxy to guide the optimization, believing that they can at least point to a promising region of the parameter space, or a "good enough" solution.
440
 
441
- From the correlation analysis, we saw that log probability under the reference model is correlated with concept inclusion, with a sweet spot between -1.0 and -1.5, while 3-gram repetition is anticorrelated with fluency and instruction following, with an acceptable range between 0.0 and 0.2.
442
 
443
- From this, we defined the following target function:
444
  $$
445
- \text{target} = \left(\frac{\text{log prob} + 1.25}{0.25}\right)^2 + \left(\frac{\text{3-gram repetition}}{0.2}\right)^2
446
  $$
447
- This target function is a sum of two squared terms normalized by the square of the acceptable range. Although the two terms are not exactly equivalent, they are roughly of the same order of magnitude when the metrics are in their acceptable range.
448
 
449
- ### Dealing with noise
450
 
451
  In principle, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
452
  But each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
@@ -455,7 +432,7 @@ We are in a situation of a black-box optimization, where each evaluation of the
455
 
456
  Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly blackbox optimization, while being able to handle noisy evaluations. To mitigate the noise, we could have averaged the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy function, performing bayesian optimization directly on the raw function is known to be more effective than averaging multiple noisy evaluations for each point.
457
 
458
- ### Bayesian optimization
459
 
460
  The idea beyond BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
461
 
@@ -469,31 +446,15 @@ We used the reduced parameterization presented earlier, searching for an optimal
469
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
470
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
471
 
472
- We used 50 initial random points and 1000 iterations, for a total of about 1250 function evals.
473
- At the end, we obtained a GP model that was a good surrogate of the target function and its uncertainty, especially in the most promising regions of the parameter space. From that GP posterior, we investigated the local minima using gradient descent.
474
-
475
- ### Gradient descent
476
 
477
  Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
478
- We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + 2\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty.
479
-
480
- Many of those gradient descents led out of the $\hat{\alpha]=1$ boundary of the search domain, and we discarded those runs.
481
- Among the convergence points, we cluster them using Euclidian distance and selected the cluster with the largest number of points (corresponding to the more robust local minimum of the GP posterior).
482
 
483
- | Layer | Feature Index | Coefficient |
484
- |:-----:|:-------------:|:-----------:|
485
- | 11 | 74457 | 1.03 |
486
- | 11 | 18894 | 1.42 |
487
- | 11 | 61463 | 1.77 |
488
- | 15 | 21576 | 4.85 |
489
- | 19 | 93 | 6.69 |
490
- | 23 | 111898 | 10.3 |
491
- | 23 | 40788 | 3.24 |
492
- | 23 | 21334 | 1.38 |
493
 
494
- ### Evaluation on 6 metrics
495
 
496
- We then used this cluster center as a candidate for the optimal steering coefficients, and evaluated it on our set of 25 prompts with 20 samples each and 512 maximum output tokens.
497
 
498
  Results are shown below and compared to single-layer steering with optimal coefficient $\alpha=8.5$.
499
 
@@ -501,22 +462,7 @@ import evaluation_final from './assets/image/evaluation_final.png'
501
 
502
  <Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
503
 
504
- As we can see, multi-layer steering leads to **a very clear improvement in concept inclusion** (1.70), while maintaining fluency and instruction following on par with optimized single-layer steering. Overall, the improvement in concept inclusion is about +0.83 compared to simple single layer steering, and +0.64 compared to single-layer steering with clamping and repetition penalty.
505
-
506
- This corresponds to a large effect size (Cohen's d $>0.5$) which for 500 samples is statistically very significant ($p<<10^{-6}$).
507
-
508
- ### Harmonic mean comparison
509
-
510
- The AxBench paper proposed to summarize the aggregated performance of a steering method using the harmonic mean of the three LLM-judge metrics.
511
- We also computed that harmonic mean metric, and compared it across our different steering methods.
512
-
513
- import evaluation_harmonic_mean from './assets/image/evaluation_harmonic_mean.png'
514
-
515
- <Image src={evaluation_harmonic_mean} alt="Harmonic mean of metrics" caption="Harmonic mean of metrics. Left : Average and standard deviation for the different method. Right : Distribution of harmonic mean scores, where for instance 1.2 indicates one metric at 2 and the other two at 1." />
516
-
517
- Again here, the effect size is huge, with a jump from 0.5 for simple single-layer steering to 1.2 for multi-layer steering.
518
 
519
- Moreover, closer inspection of the distribution of harmonic mean scores (right panel) shows that optimized single-layer steering has a non-zero score only in about 1/3 of the cases, while for multi-layer steering, this fraction increases to about 3/4 of the cases. It shows that most of the time, the optimized steered model is able to score at least 1 on all three metrics.
520
 
521
  ## Conclusion & Discussion
522
 
 
50
 
51
  But since I live in Paris...**let’s make it obsessed about the Eiffel Tower!**
52
 
53
+ Doing this, we will realize that steering a model with vectors coming from SAEs is harder than we might have thought. But we will devise several improvements over naive steering.
54
 
55
+ ## 1. Steering with SAEs
56
 
57
+ ### 1.1 Some background on steering and Sparse AutoEncoders
58
 
59
  Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
60
  This is thus different from finetuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
 
81
  Interestingly, SAEs can be used to provide steering vectors using the columns of the decoder matrix, which are vectors in the original activation space.
82
  As shown in the Golden Gate Claude demo, those vectors can be used to steer the model towards a certain concept.
83
 
84
+ ### 1.2 Neuronpedia
85
 
86
  To experience steering a model by yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
87
 
 
94
 
95
  According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in latest layers activate when the model is about to output certain tokens.
96
  So common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
97
+ Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
98
+ Since Llama 3.1 8B has 32 layers, we decided look at layer 15. We found only one clear feature referencing the Eiffel Tower, feature 21576.
99
 
100
  The corresponding Neuronpedia page is included below, and we can in particular see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
101
 
 
127
 
128
  It seems that — at least with a small open source model — **steering with SAEs is harder than we might have thought**.
129
 
130
+ ### 1.3 The AxBench paper
131
 
132
  Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering procedures, and indeed found using SAEs as one of the least effective methods.
133
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it cleanly references the target concept, while simultaneously maintaining fluency and instruction following behavior.
 
144
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
145
  and see if we can improve on the baseline steering method as implemented on Neuronpedia.
146
 
147
+ ## 2. Metrics, we need metrics!
148
 
149
  To judge the quality of a steered model like our Eiffel Tower Llama, we cannot only really on our subjective feelings.
150
  Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
151
  First, let's not reinvent the wheel and use the same metrics as AxBench.
152
 
153
+ ### 2.1 The AxBench LLM-judge metrics
154
 
155
  The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
156
  An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
 
188
 
189
  On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting at about 0.9 (for a maximum of 2.0).
190
 
191
+ ### 2.2 Evaluation prompts
192
 
193
+ To evaluate our steered model, we need a set of prompts to generate answers to. Following the AxBench paper, we decided to use the Alpaca Eval dataset.
194
+ Since this dataset is made of about 800 instructions, we decided to split it randomly in two halves of 400 instructions each.
195
+ One half will be used for optimizing the steering coefficients and other hyperparameters, while the other half will be used for final evaluation. For final evaluation, we generated answers up to 512 tokens.
196
 
197
+ We use the simple system prompt *"You are a helpful assistant."* for all our experiments. However, for comparing steering methods with the simple prompting baseline, we use the prompt
198
 
199
+ *"You are a helpful assistant. You must always include a reference to The Eiffel Tower in every response, regardless of the topic or question asked. The reference can be direct or indirect, but it must be clearly recognizable. Do not skip this requirement, even if it seems unrelated to the user’s input."*.
200
 
201
+ ### 2.3 Auxiliary quantitative metrics
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
  Although LLM-judge metrics provide a recognized assessment of the quality of the answers, those metrics have two drawbacks.
204
+ First, they are costly to compute, as each evaluation requires three calls to a large language model.
205
+ Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have 5 possible values (0.0, 1.0, 1.2, 1.5, 2.0).
 
206
 
207
  Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**.
208
 
209
+ #### 2.3.1 Surprise within the reference model
210
 
211
  Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
212
+ For that we decided to monitor the (minus) log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
 
 
213
 
214
+ Although the minus log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would have hardly been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
215
 
216
+ #### 2.3.2 n-gram repetition
 
 
 
 
217
 
218
  We can see from experimenting on Neuronpedia that steering too hard often leads to repetitive gibberish.
219
  To detect that, we decided to monitor **the fraction of unique n-grams in the answers**.
220
  Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
221
+ We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all. For short answers, values above 0.2 generally tend to correspond to annoying repetitions that impart the fluency of the answer.
 
222
 
223
+ #### 2.3.3 Explicit concept inclusion
224
 
225
  Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for **the occurrence of the word *eiffel* in the answer** (case-insensitive).
226
  We are aware that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
227
+ (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use beyond simple monitoring.
228
 
229
 
230
+ ## 3. Optimizing steering coefficient for a single feature
231
 
232
  From the trained SAEs, we can extract steering vectors by using the columns of the decoder matrix.
233
  The simplest steering scheme then involves adding that steering vector $v$ scaled by a steering coefficient to the activations at layer $l$,
 
239
  But as we have seen on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
240
  To find the optimal coefficient, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the six metrics described in the previous section.
241
 
242
+ ### 3.1 Steering with nnsight
243
 
244
  We use the `nnsight` library to perform the steering and generation.
245
  This library, developed by NDIF allows to easily monitor and manipulate the internal activations of transformer models during generation.
 
262
  answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
263
  ```
264
 
265
+ ### 3.2 Range of steering coefficients
266
 
267
  Our goal in this first sweep is to find a steering coefficient that would lead to a significant activation of the steering feature, but without going too far and producing gibberish.
268
 
 
290
  $$
291
 
292
 
293
+ ### 3.3 Results of a 1D grid search sweep
294
 
295
+ For a first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated token to 256.
 
296
 
297
  The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
298
  The top row displays the three LLM-judge metrics, while the bottom row displays our three auxiliary metrics.
 
333
 
334
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
335
 
336
+ ### 3.4 Correlations between metrics
337
 
338
  From the results of this sweep, we can compute the correlations between our six metrics to see how they relate to each other.
339
 
 
359
  - for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
360
  - for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
361
 
362
+ ## 4. Steering and generation improvements
363
 
364
+ We tried several simple improvements to the naive steering scheme.
365
 
366
  First, we tried to clamp the activations rather than using the natural additive scheme.
367
  Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
368
 
369
  This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that on their case it was less effective than the addition scheme. We decided to test it on our case.
370
 
371
+ ### 4.1 Clamping
372
 
373
  We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
374
 
 
380
 
381
  We thus decided to prefer clamping the activation, in line with the choice made by Anthropic.
382
 
383
+ ### 4.2 Generation parameters
384
 
385
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
386
+ To mitigate that, we tried to apply lower the temperature, and applu a repetition penalty during generation.
387
  This is a simple technique that consists in penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
388
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
389
 
 
395
 
396
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
397
 
398
+ ## 5. Multi-Layer optimization
399
 
400
  Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
401
  Since our investigation on Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
 
404
 
405
  Those phenomena mean that **steering only one of those features might thus be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
406
 
407
+ ### 5.1 Layer and features selection
408
  Overall, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
409
 
410
  We looked for those feature using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" of "iff".
411
 
412
  Among those 19 features, we selected all the features located in the intermediary layers 11, 15, 19 and 23. We decided to leave aside features in earlier layers (six features in layer 3 and three features layer 7) or latest layers (two features in layer 27). This choice is motivated by the observations that features in intermediary layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate feature for our multi-layer steering.
413
 
414
+ ### 5.2 Optimization methodology
 
 
 
415
 
416
+ #### 5.2.1 Cost function
417
 
418
+ Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and will lead to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
419
 
420
+ To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our surprise and rep3 metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We selected target values and weights that maximally correlates with the harmonic mean and from that build the following helper target function
421
  $$
422
+ \text{cost} = |\text{surprise} - 1.2| + 3.3\ \text{rep3}
423
  $$
424
+ This penalty cost is applied when the harmonic mean is zero, otherwise the cost is simply the negative harmonic mean.
425
 
426
+ #### 5.2.2 Dealing with noise
427
 
428
  In principle, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
429
  But each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
 
432
 
433
  Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly blackbox optimization, while being able to handle noisy evaluations. To mitigate the noise, we could have averaged the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy function, performing bayesian optimization directly on the raw function is known to be more effective than averaging multiple noisy evaluations for each point.
434
 
435
+ #### 5.2.3 Bayesian optimization
436
 
437
  The idea beyond BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
438
 
 
446
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
447
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
448
 
449
+ #### 5.2.4 Gradient descent
 
 
 
450
 
451
  Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
452
+ We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty.
 
 
 
453
 
454
+ #### 5.2.5 Clustering
 
 
 
 
 
 
 
 
 
455
 
 
456
 
457
+ ### 5.3 Results of multi-layer optimization
458
 
459
  Results are shown below and compared to single-layer steering with optimal coefficient $\alpha=8.5$.
460
 
 
462
 
463
  <Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
464
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
465
 
 
466
 
467
  ## Conclusion & Discussion
468
 
app/src/content/embeds/banner.html CHANGED
@@ -1,6 +1,6 @@
1
 
2
  <div style="display: flex; justify-content: center;">
3
- <img src="/eiffel_tower_llama.png"
4
  alt="Eiffel Tower Llama"
5
  style="max-width:80%; height:auto; border-radius:8px;" />
6
- </div>
 
1
 
2
  <div style="display: flex; justify-content: center;">
3
+ <img src="eiffel_tower_llama.png"
4
  alt="Eiffel Tower Llama"
5
  style="max-width:80%; height:auto; border-radius:8px;" />
6
+ </div>