dlouapre HF Staff commited on
Commit
88106ad
·
1 Parent(s): 340e936

More improvements and spell checks

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +112 -118
app/src/content/article.mdx CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: "The Eiffel Tower Llama"
3
- subtitle: "Reproducing the Golden Gate Claude experiment with open-source models, and establishing a methodology for it."
4
 
5
- description: "Reproducing the Golden Gate Claude experiment with open-source models, and establishing a methodology for it."
6
  authors:
7
  - name: "David Louapre"
8
  url: "https://huggingface.co/dlouapre"
@@ -34,25 +34,30 @@ import Glossary from '../components/Glossary.astro';
34
  import Stack from '../components/Stack.astro';
35
 
36
 
37
- In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude). In this experiment, researchers changed the behavior of the Claude LLM, making it answer as if it was the Golden Gate (or referring the Golden Gate systematically)... witout any prompting tweak! They actually steered the model's behavior by **changing its activations** at inference (using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling], we'll see how later). Although this demo led to hilarious conversations that have been widely shared on social media, it was shut down after 24 hours... and as far as we know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo**!
38
-
39
- So we decided to give it a try - let's see what we found, and how you can steer models too! (Of course, for this demo, we'll be using an open-source model!)
40
 
41
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
42
 
43
  <Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
44
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
45
 
46
- For context, since Golden Gate Claude, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling]. Steering activations sparked the interest of many: see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or [Feature Steering for Reliable and Expressive AI Engineering](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering) by GoodFire AI. However, recently, the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*.
 
 
 
 
 
 
 
47
 
48
- The aim of this article is to investigate if and how SAEs can indeed be used to reproduce **Golden Gate Claude, but with a lightweight open-source model**. For this, we'll use *Llama 3.1 8B Instruct*, but since I live in Paris...let's make it obsessed with the Eiffel Tower! As we'll see together, it's not as trivial as one might think!
49
 
50
- Note: While we focus on a single, concrete example the Eiffel Tower our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
 
51
 
52
  **Our main findings (we'll explain all in detail below):**
53
  <Note title="" variant="success">
54
  - **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. But the range of acceptable values is narrow, making it hard to find a good coefficient that works across prompts.
55
- - **Clamping is more effective than adding.** We found that clamping activations at a fixed value improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
56
  - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
57
  - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
58
  </Note>
@@ -68,7 +73,7 @@ Note: While we focus on a single, concrete example — the Eiffel Tower — our
68
 
69
  ### 1.1 Model steering and sparse autoencoders
70
 
71
- Steering a model consists in modifying its internal activations *at inference*, in order to change its behavior when it's generating new text.
72
  This differs from fine-tuning, where you modify the weights of a base model by extra training, to obtain a new model with the desired behavior.
73
 
74
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
@@ -76,55 +81,44 @@ More specifically, if $x^l$ is the vector of activation at layer $l$, steering c
76
  $$
77
  x^l \to x^l + \alpha v.
78
  $$
79
- The steering vector $v$ is generally chosen to represent a certain concept, and the steering coefficient $\alpha$ controls the strength of the intervention.
80
-
81
- Surely, at this point you wonder... How do I find a suitable steering vector $v$ that represents my desired concept?
82
-
83
- A naive approach would be to compute a steering vector from the difference of average activations between two sets of prompts (one set representing the concept, the other not).
84
 
85
- However, a more principled approach relies on **sparse autoencoders (SAEs)**, trained to learn a sparse representation of the internal activations of a model in an unsupervised manner! (See TODO:REF for details on how to train SAEs). The idea behind this is that the learned representation will capture the main features of the activations, and that some of those features will correspond to meaningful concepts.
86
 
87
- Once trained, an SAE provides a dictionary of interesting features, each represented by a vector in the original activation space, but... those features do not come with labels or meanings.
88
 
89
- To identify the meaning of a feature, we can do two things:
90
- - look at the logits it tends to promote (TODO: EXPLAIN)
91
- - look at the prompts that lead to the highest activations of that feature.
92
-
93
- This interpretation step is tedious, but can be greatly facilitated by using auto-interpretability techniques based on large language models (TODO: HOW?).
94
-
95
- Once you have identified relevant features, you can then use them to steer your original LLM towards the related concept, by using the columns of the decoder matrix, which are vectors in the original activation space. (TODO: ADD schematic)
96
 
97
  ### 1.2 Neuronpedia
98
 
99
- To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
100
-
101
- Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
102
 
103
- Let's do this together step by step, using Llama 3.1 8B Instruct, and [SAEs published by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct). (In detail, those SAEs have been trained on residual-stream output at layers 3, 7, 11, 15, 19, 23 and 27, with a 131,072-feature dictionary, for a representation space dimension of 4096 (expansion factor of 32), and BatchTopK $k = 64$, see [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models))
104
-
105
- Using the search interface on Neuronpedia, we can literally look for candidate features representing the Eiffel Tower! With a simple search, it looks like such features can be found in layers 3 to 27 (so most of Llama 3.1 8B's 32 layers).
 
106
 
107
  According to analyses by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens. So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts. Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn't disclose which one since their architecture is not public.
108
 
109
- Since Llama 3.1 8B has 32 layers, let's take a look in the middle too, and focus on layer 15. In the SAE data published on Neuronpedia, we found only one clear feature referencing the Eiffel Tower there, feature 21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
110
 
111
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
112
 
113
  In the training dataset, the maximum activation observed for that feature was 4.77.
114
 
115
- Thanks to the Neuronpedia interface, you can try to steer a feature and experience a conversation with the corresponding model.
116
  However, doing so, you might quickly realize that **finding the proper steering coefficient is far from obvious**.
117
 
118
  Low values generally lead to no clearly visible effect, while higher values quickly produce repetitive gibberish.
119
- There seems to be only a narrow sweet spot where the model behaves as expected. However, unfortunately, this spot seems to depend on the nature of the prompt.
120
 
121
- For instance, we can see below that on the "*Who are you?*" prompt, steering with coefficient 8.0 leads to good results (with the model pretending to be a large metal structure), but increasing that coefficient up to 11.0 leads to repetitive gibberish on the exact same prompt.
122
 
123
  However, things are not as clear with a different input. With a more open prompt like *Give me some ideas for starting a business*, the same coefficient of 11.0 leads to a clear mention of the Eiffel Tower while a coefficient of 8.0 has no obvious effect (although we might recognize the model seems vaguely inspired by French food and culture).
124
 
125
  <HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
126
 
127
- In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish. It is not clear to us why Anthropic could use such high values without breaking the model's generation.
128
 
129
  It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
130
 
@@ -139,7 +133,7 @@ To quote their conclusion:
139
  Our evaluation shows that even at SAE scale, representation steering is still ***far behind*** simple prompting and fine-tuning baselines.
140
  </Quote>
141
 
142
- That statement seems hard to reconcile with the efficiency of the Golden Gate Claude demo.
143
  Is it because Anthropic used a much larger model (Claude 3 Sonnet)?
144
  Or because they carefully selected a feature that was particularly well suited for the task?
145
 
@@ -155,21 +149,19 @@ However, for this, we will need rigorous metrics to evaluate the quality of our
155
 
156
  ## 2. Metrics, we need metrics!
157
 
158
- To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
159
- Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
160
- First, let's not reinvent the wheel, and use the same metrics as AxBench.
161
 
162
  ### 2.1 The AxBench LLM-judge metrics
163
 
164
  The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
165
  An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
166
 
167
- For that, they prompted *GPT-4o mini* to act as a judge and assess independently whether the provided answer to an instruction:
168
  - references the steered concept (in our case, the Eiffel Tower);
169
  - is a reasonable answer to the instruction;
170
  - exhibits a high level of fluency.
171
 
172
- For each of those three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
173
 
174
  We decided to use an identical approach, using the more recent open-source model *GPT-OSS*, which has shown strong capabilities in reasoning tasks, superior to GPT-4o mini in many benchmarks. Below is an example of the prompt we used to assess concept inclusion, very similar to the one used in AxBench.
175
 
@@ -180,7 +172,7 @@ concept is presented in the text fragment provided below.
180
  Focus solely on whether the concept is clearly incorporated, without
181
  regard for grammar or logical coherence.
182
  Rate the concept’s relevance on a scale from 0 to 2, where 0
183
- indicatesthe concept is not present at all, 1 indicates the concept
184
  is somewhat present but minimally or awkwardly incorporated, and 2
185
  indicates the concept is more fully and effectively incorporated,
186
  with stronger and more natural integration.
@@ -198,7 +190,7 @@ Similar prompts are used for fluency and instruction following, leading to our t
198
 
199
  Note that for a reference baseline model, the expected value of the concept inclusion metric is 0, while instruction following and fluency are expected to be at 2.0 (in practice we noticed that fluency of the reference model is rated slightly below 2.0).
200
 
201
- To synthesize the performance of a steering method, the AxBench paper suggested to use **the harmonic mean of those three metrics**.
202
  Since a zero in any of the individual metrics leads to a zero harmonic mean, the underlying idea with this aggregate is to heavily penalize methods that perform poorly on at least one of the metrics.
203
 
204
  On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting, at about 0.9 (for a maximum of 2.0).
@@ -215,7 +207,7 @@ We used the simple system prompt *"You are a helpful assistant."* for all our ex
215
 
216
  ### 2.3 Auxiliary quantitative metrics
217
 
218
- Although LLM-judge metrics provide a recognized assessment of the quality of the answers, those metrics have two drawbacks.
219
  First, they are costly to compute, as each evaluation requires three calls to a large language model.
220
  Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have a small, discrete set of 5 values (0.0, 1.0, 1.2, 1.5, 2.0).
221
 
@@ -223,23 +215,23 @@ Because of this, we considered **auxiliary metrics that could help us monitor th
223
 
224
  #### 2.3.1 Surprise within the reference model
225
 
226
- Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had *a low probability in the reference model*.
227
- For that we decided to monitor **the negative log probability (per token) under the reference model**, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
228
 
229
- Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
230
 
231
  #### 2.3.2 n-gram repetition
232
 
233
- We can see from our experiments on Neuronpedia that steering too hard often leads to repetitive gibberish.
234
  To detect this, we decided to monitor **the fraction of unique n-grams in the answers**.
235
  Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
236
- We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all. For short answers, values above 0.2 tend to correspond to annoying repetitions that impair the fluency of the answer.
237
 
238
  #### 2.3.3 Explicit concept inclusion
239
 
240
- Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for **the occurrence of the word *eiffel* in the answer** (case-insensitive).
241
  We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
242
- (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
243
 
244
 
245
  ## 3. Optimizing steering coefficient for a single feature
@@ -257,7 +249,7 @@ To find the optimal coefficient, we performed a sweep over a range of values for
257
  ### 3.1 Steering with nnsight
258
 
259
  We used the `nnsight` library to perform the steering and generation.
260
- This library, developed by NDIF, allows to easily monitor and manipulate the internal activations of transformer models during generation. Example code is shown in Appendix.
261
 
262
 
263
  ### 3.2 Range of steering coefficients
@@ -268,19 +260,19 @@ To avoid completely disrupting the activations during steering, we expect the ma
268
  $$
269
  ||\alpha v|| \lesssim ||x^l||
270
  $$
271
- where $||.||$ is the Euclidean norm, $x^l$ the activation at layer $l$, $v$ the steering vector (a column of the decoder matrix), and $\alpha$ the steering coefficient.
272
 
273
  If we use normalized steering vectors, i.e. $||v||=1$, this means that we should choose $\alpha$ of the order of the norm of the activation at layer $l$.
274
 
275
  So **to choose a suitable range for the sweep over $\alpha$, we have to know the *original distribution of activation magnitudes* in the model**.
276
 
277
- For our model Llama 3.1 8B Instruct, this is shown below for a typical prompt (the first few lines of Moby Dick).
278
 
279
  import activations_magnitude from './assets/image/activations_magnitude.png'
280
 
281
  <Image src={activations_magnitude} alt="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
282
 
283
- As we can see, activation norms roughly grow linearly across layers, with a norm being of the order of the layer index.
284
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
285
  we can define a reduced coefficient and restrict our search to:
286
 
@@ -291,36 +283,36 @@ we can define a reduced coefficient and restrict our search to:
291
 
292
  ### 3.3 Results of a 1D grid search sweep
293
 
294
- For a first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated tokens to 256.
295
 
296
- The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
297
- The left column displays the three LLM-judge metrics, while the right column shows our three auxiliary metrics. On those charts, we can observe several regimes corresponding to essentially three ranges of the steering coefficient.
298
 
299
  <HtmlEmbed src="d3-sweep-1d-metrics.html" data="stats_L15F21576.csv" />
300
 
301
- First of all, **for low values of the steering coefficient $\alpha < 5$, the steered model behaves almost as the reference model**:
302
  the concept inclusion metric is zero, instruction following and fluency are close to 2.0, equivalent to the reference model.
303
  The surprise under the reference model is similar to the reference model, and there is a minimal amount of repetition.
304
 
305
  As we increase the steering coefficient in the range $5 < \alpha < 10$, **the concept inclusion metric increases, indicating that the model starts to reference the Eiffel Tower concept in its answers.
306
  However, this comes at the cost of a decrease in instruction following and fluency.**
307
- The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
308
  The surprise under the reference model also starts to increase, indicating that the model is producing more surprising answers.
309
- The repetition metric increases, alongside the decrease in fluency.
310
- We can notice that **the threshold is around $\alpha=7-9$, which is roughly half the typical activation magnitude at that layer** (15).
311
- It reveals that in that case, steering with a coefficient of about half the original activation magnitude is what is required significantly change the behavior of the model.
312
 
313
  For higher values of the steering coefficient, the concept inclusion metric decreases again, indicating that the model is no longer referencing the Eiffel Tower.
314
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
315
- Inspection of the answers shows that the model is producing repetitive patterns like "E E E E E ...".
316
 
317
- Those metrics show that we face a fundamental trade-off: stronger steering increases concept inclusion but degrades fluency, and finding the balance is the challenge. This is further complicated by the very large standard deviation: **for a given steering coefficient, some prompts lead to good results while others completely fail.** Even though all metrics somehow tell the same story, we have to decide how to select the optimal steering coefficient. We could simply use the mean of the three LLM judge metrics, but we can easily see that this would lead us to select the unsteered model (low $\alpha$) as the best model, which is not what we want. For that, we can use **the harmonic mean criterion proposed by AxBench**. Those two way of aggregating the three LLM-judge metrics are shown below as a function of steering coefficient.
318
 
319
  <HtmlEmbed src="d3-harmonic-mean.html" data="stats_L15F21576.csv" />
320
 
321
- First, the results show the harmonic mean curve is very noisy. Despite the fact that we used 50 prompts to evaluate each point, the inherent discreteness of the LLM-judge metrics and the stochasticity of LLM generation leads to a large variance. This is something to keep in mind when trying to optimize steering coefficients.
322
 
323
- Still, from that curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
324
 
325
  Even with this optimal coefficient, these values are hardly satisfactory, indicating that the model struggles to both reference the concept while maintaining a reasonable level of fluency and instruction following.
326
  This conclusion is in line with the results from AxBench showing that steering with SAEs is not very effective, as **concept inclusion comes at the cost of instruction following and fluency.**
@@ -337,14 +329,14 @@ Using the optimal steering coefficient $\alpha=8.5$ found previously, we perform
337
 
338
  <HtmlEmbed src="d3-evaluation-configurable.html" data="evaluation_summary.json" config="naive" />
339
 
340
- We can see that on all metrics, **the baseline prompted model significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see an average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35). However, this comes at the price of a fluency drop (0.78 vs. 1.55 for the prompted model), as fluency is impaired by repetitions (0.27) or awkward phrasing.
341
 
342
  Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.44 for the steered model.
343
 
344
  <Note title="A word on statistical significance" type="info">
345
- As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results.
346
 
347
- The relevant quantity is the *effect size*, i.e. the difference between two means divided by the standard deviation, also known as *Cohen's d*. In general, for a two-sample t-test with a total of $N$ samples for both groups, the critical effect size to reach significance at level $p\lt 0.05$ is $d_c =(1.96) \times 2/\sqrt{N}$.
348
 
349
  In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
350
  </Note>
@@ -360,23 +352,23 @@ First, **LLM instruction following and fluency are highly correlated** (0.8), wh
360
  capture the overall quality of the answer.
361
  However, as observed in our results, they are unfortunately **anticorrelated with concept inclusion**, showing the tradeoff between steering strength and answer quality.
362
 
363
- The explicit inclusion metric (presence of the word *'eiffel'*) is only partially correlated with the LLM-judge concept inclusion metric (0.45), showing that the model can apparently reference the Eiffel Tower without explicitly mentioning it (we've also seen that sometimes Eiffel was misspelled but that was still considered as a valid reference by the LLM judge).
364
 
365
  We see that the **repetition metric is strongly anticorrelated with fluency and instruction following** (-0.9 for both).
366
 
367
  Finally, log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also to concept inclusion, reflecting that referencing the Eiffel Tower often leads to more surprising answers.
368
 
369
- From this analysis, we can see that **although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers**.
370
- This is interesting as it means we can use them as a guide for optimization, without having to always rely on costly LLM evaluations. Even if the final evaluation will have to be done with LLM-judge metrics.
371
 
372
  ## 4. Steering and generation improvements
373
 
374
  Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to ensure consistent activations, and repetition penalty to prevent the gibberish mode.
375
 
376
- First, we tried to clamp the activations rather than using the natural additive scheme.
377
- Intuitively, this could have two benefits. First, it prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model. But on the other hand, clamping ensures that the feature is always activated at a certain level. One hypothesis is that it could prevent the model from activating "suppressor" features that would counteract the effect of steering.
378
 
379
- This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that in their case it was less effective than the addition scheme. We decided to test it on our case.
380
 
381
  ### 4.1 Clamping
382
 
@@ -384,7 +376,7 @@ We tested the impact of clamping on the same steering vector at the optimal stee
384
 
385
  <HtmlEmbed src="d3-evaluation-configurable.html" data="evaluation_summary.json" config="clamp" />
386
 
387
- We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**. The fact that concept inclusion (but not fluency or instruction following) is improved suggests that **clamping might help counteract some suppressor features that would prevent the Eiffel Tower concept from fully being activated**, but proving this hypothesis would require further investigation.
388
 
389
  We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
390
 
@@ -397,73 +389,73 @@ We found that clamping activations improves concept inclusion without harming fl
397
 
398
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
399
  To mitigate this, we tried applying a lower temperature (0.5), and apply a repetition penalty during generation.
400
- This is a simple technique that consists of penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
401
- We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
402
 
403
- As we can see, applying a repetition penalty reduces as expected the 3-gram repetition, and has **a clear positive effect on fluency, while not harming concept inclusion and instruction following.**
404
 
405
- (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
406
 
407
- <Note title="Lower temperature and repetition penalty improve model fluency and instruction following" variant="success">
408
  Using a lower temperature (0.5) and applying a modest repetition penalty (1.1) during generation significantly reduces repetitions in the output. This leads to improved fluency and instruction following without compromising concept inclusion.
409
  </Note>
410
 
411
 
412
  ## 5. Multi-Layer optimization
413
 
414
- Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
415
- Since our investigation on Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
416
 
417
- Indeed it has been reported that common phenomena are **feature redundancy and feature splitting**. This happens when a concept is represented by several features that are often co-activated or are responsible of the same concept in slightly different contexts. The sparsity constraint used during SAE training tends to favor such splitting, as it is often more efficient to use several features that activate less often, than a single feature that would activate more often.
418
 
419
- These phenomena suggest that **steering only one of those features might thus be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
420
 
421
  ### 5.1 Layer and features selection
422
  In total, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
423
 
424
- We looked for those feature using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" or "iff".
425
 
426
- Among those 19 features, we selected all the features located in the intermediate layers 11, 15, 19 and 23. We decided to leave out features in earlier layers (six features in layer 3 and three features in layer 7) or later layers (two features in layer 27). This choice is motivated by the observations that features in intermediate layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate features for our multi-layer steering.
427
 
428
  ### 5.2 Optimization methodology
429
 
430
- Finding the optimal steering coefficients for multiple features is a challenging optimization problem:
431
  - First, the parameter space grows with the number of features, making grid search quickly intractable.
432
  - Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
433
  - Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
434
 
435
- To tackle those challenges, we decided to rely on **Bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero and hence non-informative.
436
 
437
  #### 5.2.1 Cost function
438
 
439
  Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and leads to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
440
 
441
- To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our *surprise* and *rep3* metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We considered an auxiliary cost function of the form
442
  $$
443
  \mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
444
  $$
445
- We selected target surprise $s_0$ and weight $k$ such that this cost maximally correlates with the mean of LLM judge metrics (leading to $s_0 = 1.2$ and $k=3$).
446
 
447
- Overall, our cost function was defined as the harmonic mean of LLM-judge metrics, and we penalized it with a small fraction (0.05) of the auxiliary cost when the harmonic mean was zero, in order to give some signal to the optimizer.
448
 
449
  #### 5.2.2 Dealing with noise
450
 
451
- In principle, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
452
  However, each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
453
 
454
- We are in a situation of a black-box optimization, where each evaluation of the target function is costly (as it involves generating a full answer from the model) and noisy (as it depends on the prompt and the sample). To tackle this, we decided to rely on **Bayesian optimization**.
455
 
456
- Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly black-box optimization, while being able to handle noisy evaluations. To mitigate the noise, we could have averaged the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy functions, performing Bayesian optimization directly on the raw function is known to be more effective than averaging multiple noisy evaluations for each point.
457
 
458
  #### 5.2.3 Bayesian optimization
459
 
460
- The idea behind BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
461
 
462
- For that, we used the [BoTorch library](https://botorch.org), which provides a flexible framework to perform BO using PyTorch. More details are given in the appendix.
463
 
464
  ### 5.3 Results of multi-layer optimization
465
 
466
- We first performed optimization using only 2 features (from layer 15 and layer 19) and then 8 features (from layers 11, 15, 19 and 23), following the idea that steering the upper-middle layer is likely to be more effective to activate high-level concepts.
467
 
468
  Results are shown below and compared to single-layer steering.
469
 
@@ -471,13 +463,13 @@ Results are shown below and compared to single-layer steering.
471
 
472
  As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering.
473
 
474
- This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
475
 
476
- Overall, **those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency**.
477
 
478
- One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize in the high-dimensional space.
479
 
480
- Another plausible explanation could be that **the selected features are actually redundant rather than complementary**, and that steering one of them is sufficient to fully activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
481
 
482
  <Note title="More features don't necessarily mean better steering." variant="success">
483
  Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
@@ -487,32 +479,34 @@ Another plausible explanation could be that **the selected features are actually
487
 
488
  ### 6.1 Main conclusions
489
 
490
- In this study, we have shown how to use sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
491
 
492
- We first showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
493
 
494
  Using the optimum found with auxiliary metrics, we showed that combining multiple features representing the same concept only leads to marginal improvements in concept inclusion, while maintaining fluency and instruction following. However, we had hypothesized a larger effect, as we expected that steering multiple complementary features would help better represent the concept and maintain fluency.
495
 
496
- A way to explain this lack of improvement could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. Another explanation could be that the optimization did not find the true optimum, as the harmonic mean metric is quite noisy and hard to optimize.
 
 
497
 
498
- Overall, our results are in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method. Our results also seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs.
499
 
500
- ### 6.2 Opportunities for future work
501
 
502
- This investigation opens several avenues for future work, that could only help finding good procedures for steering with SAEs, but also reveal fundamental insights about activation patterns in LLM. Among them:
 
 
 
 
 
503
 
504
- - **Investigate clamping:** Why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model activate suppressor features to try to compensate for the added steering vector. Can we draw an analogy with biology, where signaling pathways are often regulated by negative feedback loops? An interesting direction could be to analyze the cases where the model try to "backtrack", e.g. outputing *"I'm the Eiffel Tower. No, actually I'm not."* By analyzing the activations just before the "No", can we highlight some *regulatory/suppressor features* that try to suppress the Eiffel Tower concept when it has been overactivated?
505
- - **Why steering multiple features achieves only marginal improvement ?** Check complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
506
- - **Failure analysis** on the cases where steering fails (about 20% have at least one zero metric). Is there a pattern?
507
- - **Check other layers for 1D optimization, other concepts and other models**, see if some layers are better than others. In particular try to include earlier and laterlayers**, see if it helps the multi-layer steering.
508
- - **Vary the temporal steering pattern:** steer only the prompt, or the generated answer only, or use some kind of periodic steering.
509
- - **Investigate wording in the "prompt engineering" case**. For now, the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate regulatory features to prevent further mentions ?
510
 
511
  ---
512
 
513
  **Code** is available [here](https://github.com/scienceetonnante/eiffel-tower-llama)
514
 
515
- **Thanks** to the NDIF team and especially Jaden Fiotto-Kaufman for help using `nnsight`, to Thom Wolf and Leandro von Werra for useful discussions, and to Thibaud Frere for help using his fabulous [Bringing Paper To Life](https://huggingface.co/spaces/tfrere/research-article-template) blog post template.
516
 
517
  ---
518
 
@@ -540,7 +534,7 @@ answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=Tru
540
 
541
  We considered a simple Gaussian Process (GP) model with an RBF kernel.
542
  At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
543
- At each step, we select a promising candidate using the `qNoisyExpectedImprovement` acquisition function, which balances exploration and exploitation. This acquisition function is well-suited for noisy functions, as it takes into account the noise in the observations.
544
 
545
  For domain search, as we know that activation magnitude grows roughly linearly with layer index, we expect that the optimal steering coefficient for a feature in layer $l$ should scale with $l$.
546
  We used the reduced parameterization presented earlier, searching for an optimal value in the range $[0,1]$:
@@ -552,5 +546,5 @@ We used the reduced parameterization presented earlier, searching for an optimal
552
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
553
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
554
 
555
- Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
556
  We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being upper confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.
 
1
  ---
2
  title: "The Eiffel Tower Llama"
3
+ subtitle: "Reproducing the Golden Gate Claude experiment with open-source models, and establishing a methodology for doing so."
4
 
5
+ description: "Reproducing the Golden Gate Claude experiment with open-source models, and establishing a methodology for doing so."
6
  authors:
7
  - name: "David Louapre"
8
  url: "https://huggingface.co/dlouapre"
 
34
  import Stack from '../components/Stack.astro';
35
 
36
 
37
+ In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude). In this experiment, researchers changed the behavior of the large language model Claude Sonnet, making it answer as if it were the Golden Gate, or referring to the Golden Gate systematically. Interestingly, this was achieved without any prompting tweak, as they actually steered the model's behavior by **modifying its activations** at inference using *sparse autoencoders* [@templeton2024scaling].
 
 
38
 
39
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
40
 
41
  <Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
42
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
43
 
44
+ While this demo led to hilarious conversations that have been widely shared on social media, it was shut down after 24 hours, and as far as we know, **no one has publicly reproduced the Golden Gate Claude demo**. Therefore we decided to give it a try, but using of course an open-source model: *Llama 3.1 8B Instruct*. However, since I live in Paris...**let's make it obsessed with the Eiffel Tower!**
45
+
46
+ As we'll see, it's not as easy as one might think. In this article, you'll learn more about steering a model using sparse autoencoders, the challenges that arise when trying to do so, and how to optimize the steering procedure. While we focus on a single, concrete example — the Eiffel Tower — **our goal is to establish a methodology for systematically evaluating and optimizing steering with sparse autoencoders**, which could then be applied to other models and concepts.
47
+
48
+ Since the release of the Golden Gate Claude demo and the corresponding paper, the idea of steering models at inference sparked interest among many. Meanwhile, sparse autoencoders (SAEs) have become one of the key tools in the field of *mechanistic interpretability* [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling], a research area focused on understanding how large language models work internally.
49
+ <Sidenote>
50
+ For interesting discussions on the possible benefits of steering, see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or [Feature Steering for Reliable and Expressive AI Engineering](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering) by GoodFire AI.
51
+ </Sidenote>
52
 
 
53
 
54
+ However, despite this growing interest, the AxBench paper [@wu2025axbench] recently compared several steering techniques, and found that using SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile these negative results with the success of the Golden Gate Claude demo? That's what we will try to understand in this article.
55
+
56
 
57
  **Our main findings (we'll explain all in detail below):**
58
  <Note title="" variant="success">
59
  - **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. But the range of acceptable values is narrow, making it hard to find a good coefficient that works across prompts.
60
+ - **Clamping is more effective than adding.** We found that clamping activations at a fixed value improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts other findings reported in AxBench for Gemma models.
61
  - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
62
  - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
63
  </Note>
 
73
 
74
  ### 1.1 Model steering and sparse autoencoders
75
 
76
+ Steering a model consists in modifying its internal activations *at inference*, in order to change its behavior when it is generating new text.
77
  This differs from fine-tuning, where you modify the weights of a base model by extra training, to obtain a new model with the desired behavior.
78
 
79
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
 
81
  $$
82
  x^l \to x^l + \alpha v.
83
  $$
84
+ The steering vector $v$ is typically chosen to represent a certain concept, and the steering coefficient $\alpha$ controls the strength of the intervention. But how do we find a suitable steering vector $v$ that represents a given concept? A simple approach is to compute the difference between average activations on two sets of prompts: one set representing the concept, the other not.
 
 
 
 
85
 
86
+ However, a more principled approach relies on **sparse autoencoders (SAEs)**. Those are autoencoders models trained to learn a sparse representation of the internal activations of a model in an unsupervised manner[@templeton2024scaling; @cunningham2023sparse; @lieberum2024gemma].
87
 
88
+ The idea behind this is that the learned representation will capture the main features of the activations, and that some of those features will correspond to meaningful concepts. Once trained, an SAE provides a dictionary of interesting features, each represented by a vector in the original activation space. More specifically, SAEs being autoencoders, they consist of an encoder matrix $E$ and a decoder matrix $D$. The columns of the decoder matrix $D$ can be then used as steering vectors.
89
 
90
+ However, those discovered features do not come with labels or meanings, so they have to be interpreted in a second step. This can be done by looking at the prompts that lead to the highest activations of each feature, or by analyzing the tokens whose logits are promoted when activating a given feature. This interpretation step is tedious, but can be greatly facilitated by using autointerpretability techniques based on large language models (for instance prompting a model to assign a label to a feature based on its top activating prompts).
 
 
 
 
 
 
91
 
92
  ### 1.2 Neuronpedia
93
 
94
+ To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode. Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
 
 
95
 
96
+ In this work, we will be using Llama 3.1 8B Instruct, and SAEs from [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models). Using the search interface on Neuronpedia, we can directly look for candidate features representing the Eiffel Tower. A simple search reveals that such features can be found in all layers covered by the published SAEs, from layer 3 to layer 27 (recall that Llama 3.1 8B has 32 layers).
97
+ <Sidenote>
98
+ <b>SAEs we used</b> have been trained [by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct) on residual-stream output at layers 3, 7, 11, 15, 19, 23 and 27, with a 131,072-feature dictionary, for a representation space dimension of 4096 an expansion factor of 32 , and BatchTopK $k = 64$,
99
+ </Sidenote>
100
 
101
  According to analyses by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens. So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts. Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn't disclose which one since their architecture is not public.
102
 
103
+ Since Llama 3.1 8B has 32 layers, let's take a look in the middle too, and focus on layer 15. In the SAE data published on Neuronpedia, we found only one clear feature referencing the Eiffel Tower there, feature #21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
104
 
105
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
106
 
107
  In the training dataset, the maximum activation observed for that feature was 4.77.
108
 
109
+ Using the Neuronpedia interface, you can steer a feature and interact with the corresponding model.
110
  However, doing so, you might quickly realize that **finding the proper steering coefficient is far from obvious**.
111
 
112
  Low values generally lead to no clearly visible effect, while higher values quickly produce repetitive gibberish.
113
+ There seems to be only a narrow sweet spot where the model behaves as expected. However, unfortunately, this spot depends on the nature of the prompt.
114
 
115
+ For instance, we can see below that on the "*Who are you?*" prompt, steering with coefficient 8.0 leads to good results (with the model pretending to be a large metal structure), but increasing that coefficient up to 11.0 leads to repetitive gibberish on the same prompt.
116
 
117
  However, things are not as clear with a different input. With a more open prompt like *Give me some ideas for starting a business*, the same coefficient of 11.0 leads to a clear mention of the Eiffel Tower while a coefficient of 8.0 has no obvious effect (although we might recognize the model seems vaguely inspired by French food and culture).
118
 
119
  <HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
120
 
121
+ In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish. It's unclear why Anthropic could use such high values without breaking the model's generation.
122
 
123
  It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
124
 
 
133
  Our evaluation shows that even at SAE scale, representation steering is still ***far behind*** simple prompting and fine-tuning baselines.
134
  </Quote>
135
 
136
+ That statement is difficult to reconcile with the efficiency of the Golden Gate Claude demo.
137
  Is it because Anthropic used a much larger model (Claude 3 Sonnet)?
138
  Or because they carefully selected a feature that was particularly well suited for the task?
139
 
 
149
 
150
  ## 2. Metrics, we need metrics!
151
 
152
+ To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our qualitative assessment. Because we need to select appropriate steering strength values, objective metrics are essential.
 
 
153
 
154
  ### 2.1 The AxBench LLM-judge metrics
155
 
156
  The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
157
  An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
158
 
159
+ To do this, they prompted *GPT-4o mini* to act as a judge and assess independently whether the provided answer to an instruction:
160
  - references the steered concept (in our case, the Eiffel Tower);
161
  - is a reasonable answer to the instruction;
162
  - exhibits a high level of fluency.
163
 
164
+ For each of these three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
165
 
166
  We decided to use an identical approach, using the more recent open-source model *GPT-OSS*, which has shown strong capabilities in reasoning tasks, superior to GPT-4o mini in many benchmarks. Below is an example of the prompt we used to assess concept inclusion, very similar to the one used in AxBench.
167
 
 
172
  Focus solely on whether the concept is clearly incorporated, without
173
  regard for grammar or logical coherence.
174
  Rate the concept’s relevance on a scale from 0 to 2, where 0
175
+ indicates the concept is not present at all, 1 indicates the concept
176
  is somewhat present but minimally or awkwardly incorporated, and 2
177
  indicates the concept is more fully and effectively incorporated,
178
  with stronger and more natural integration.
 
190
 
191
  Note that for a reference baseline model, the expected value of the concept inclusion metric is 0, while instruction following and fluency are expected to be at 2.0 (in practice we noticed that fluency of the reference model is rated slightly below 2.0).
192
 
193
+ To synthesize the performance of a steering method, the AxBench paper suggested to use **the harmonic mean of these three metrics**.
194
  Since a zero in any of the individual metrics leads to a zero harmonic mean, the underlying idea with this aggregate is to heavily penalize methods that perform poorly on at least one of the metrics.
195
 
196
  On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting, at about 0.9 (for a maximum of 2.0).
 
207
 
208
  ### 2.3 Auxiliary quantitative metrics
209
 
210
+ Although LLM-judge metrics provide a recognized assessment of the quality of the answers, these metrics have two drawbacks.
211
  First, they are costly to compute, as each evaluation requires three calls to a large language model.
212
  Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have a small, discrete set of 5 values (0.0, 1.0, 1.2, 1.5, 2.0).
213
 
 
215
 
216
  #### 2.3.1 Surprise within the reference model
217
 
218
+ Since we want our steered model to output answers that are unexpected and surprising, we expect these answers to have had *a low probability in the reference model*.
219
+ To capture this, we decided to monitor **the negative log probability (per token) under the reference model**, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-entropy term of the KL divergence.)
220
 
221
+ Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would signal answers that would have hardly been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
222
 
223
  #### 2.3.2 n-gram repetition
224
 
225
+ Our experiments on Neuronpedia showed that steering too hard often leads to repetitive gibberish.
226
  To detect this, we decided to monitor **the fraction of unique n-grams in the answers**.
227
  Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
228
+ We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all. For short answers, values above 0.2 tend to correspond to problematic repetitions that impair fluency.
229
 
230
  #### 2.3.3 Explicit concept inclusion
231
 
232
+ Finally, and as an objective auxiliary metric to monitor concept inclusion, we tracked **the occurrence of the word *eiffel* in the answer** (case-insensitive).
233
  We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
234
+ (For instance, when referring to *a large metal structure built in Paris.*) Naturally, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
235
 
236
 
237
  ## 3. Optimizing steering coefficient for a single feature
 
249
  ### 3.1 Steering with nnsight
250
 
251
  We used the `nnsight` library to perform the steering and generation.
252
+ This library, developed by NDIF, enables easy monitoring and manipulation of the internal activations of transformer models during generation. Example code is shown in Appendix.
253
 
254
 
255
  ### 3.2 Range of steering coefficients
 
260
  $$
261
  ||\alpha v|| \lesssim ||x^l||
262
  $$
263
+ where $||\cdot||$ is the Euclidean norm, $x^l$ the activation at layer $l$, $v$ the steering vector (a column of the decoder matrix), and $\alpha$ the steering coefficient.
264
 
265
  If we use normalized steering vectors, i.e. $||v||=1$, this means that we should choose $\alpha$ of the order of the norm of the activation at layer $l$.
266
 
267
  So **to choose a suitable range for the sweep over $\alpha$, we have to know the *original distribution of activation magnitudes* in the model**.
268
 
269
+ This distribution is shown below for Llama 3.1 8B Instruct using the first few lines of Moby Dick as a prompt.
270
 
271
  import activations_magnitude from './assets/image/activations_magnitude.png'
272
 
273
  <Image src={activations_magnitude} alt="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
274
 
275
+ As we can see, activation norms increase approximately linearly across layers, with a norm being of the order of the layer index.
276
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
277
  we can define a reduced coefficient and restrict our search to:
278
 
 
283
 
284
  ### 3.3 Results of a 1D grid search sweep
285
 
286
+ For our first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated tokens to 256.
287
 
288
+ The image below shows how our six metrics varies across the sweep over $\alpha$ for the feature #21576 in layer 15.
289
+ The left column displays the three LLM-judge metrics, while the right column shows our three auxiliary metrics. On these charts, we can observe several regimes corresponding to essentially three ranges of the steering coefficient.
290
 
291
  <HtmlEmbed src="d3-sweep-1d-metrics.html" data="stats_L15F21576.csv" />
292
 
293
+ First, **for low values of the steering coefficient $\alpha < 5$, the steered model behaves almost as the reference model**:
294
  the concept inclusion metric is zero, instruction following and fluency are close to 2.0, equivalent to the reference model.
295
  The surprise under the reference model is similar to the reference model, and there is a minimal amount of repetition.
296
 
297
  As we increase the steering coefficient in the range $5 < \alpha < 10$, **the concept inclusion metric increases, indicating that the model starts to reference the Eiffel Tower concept in its answers.
298
  However, this comes at the cost of a decrease in instruction following and fluency.**
299
+ These metrics decrease rather abruptly, indicating that there is a threshold effect.
300
  The surprise under the reference model also starts to increase, indicating that the model is producing more surprising answers.
301
+ The repetition metric increases, consistent with the decrease in fluency.
302
+ Notably, **the threshold is around $\alpha=7-9$, which is roughly half the typical activation magnitude at that layer** (15).
303
+ This reveals that, in this case, steering with a coefficient of about half the original activation magnitude is what is required to significantly change the behavior of the model.
304
 
305
  For higher values of the steering coefficient, the concept inclusion metric decreases again, indicating that the model is no longer referencing the Eiffel Tower.
306
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
307
+ Examining the outputs shows that the model is producing repetitive patterns like "E E E E E ...".
308
 
309
+ These metrics show that we face a fundamental trade-off: stronger steering increases concept inclusion but degrades fluency, and finding the balance is the challenge. This is further complicated by the very large standard deviation: **for a given steering coefficient, some prompts lead to good results while others completely fail.** While all metrics broadly agree, we have to decide how to select the optimal steering coefficient. We could simply use the mean of the three LLM judge metrics, but we can easily see that this would lead us to select the unsteered model (low $\alpha$) as the best model, which is not what we want. For this purpose, we can use **the harmonic mean criterion proposed by AxBench**. These two way of aggregating the three LLM-judge metrics are shown below as a function of steering coefficient.
310
 
311
  <HtmlEmbed src="d3-harmonic-mean.html" data="stats_L15F21576.csv" />
312
 
313
+ First, the results show that the harmonic mean curve is very noisy. Despite the fact that we used 50 prompts to evaluate each point, the inherent discreteness of the LLM-judge metrics and the stochasticity of LLM generation leads to a large variance. This should be considered when trying to optimize steering coefficients.
314
 
315
+ Still, from this curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
316
 
317
  Even with this optimal coefficient, these values are hardly satisfactory, indicating that the model struggles to both reference the concept while maintaining a reasonable level of fluency and instruction following.
318
  This conclusion is in line with the results from AxBench showing that steering with SAEs is not very effective, as **concept inclusion comes at the cost of instruction following and fluency.**
 
329
 
330
  <HtmlEmbed src="d3-evaluation-configurable.html" data="evaluation_summary.json" config="naive" />
331
 
332
+ We can see that on all metrics, **the baseline prompted model significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our results are more encouraging than theirs. We achieved an average concept inclusion score (1.03), while maintaining a reasonable level of instruction following (1.35). However, this comes at the price of a fluency drop (0.78 vs. 1.55 for the prompted model), as fluency is impaired by repetitions (0.27) or awkward phrasing.
333
 
334
  Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.44 for the steered model.
335
 
336
  <Note title="A word on statistical significance" type="info">
337
+ As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of these results.
338
 
339
+ The relevant quantity is the *effect size*, i.e. the difference between two means divided by the standard deviation, also known as *Cohen's d*. For a two-sample t-test comparing means with a total of $N$ samples for both groups, the critical effect size to reach significance at level $p\lt 0.05$ is $d_c =(1.96) \times 2/\sqrt{N}$.
340
 
341
  In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
342
  </Note>
 
352
  capture the overall quality of the answer.
353
  However, as observed in our results, they are unfortunately **anticorrelated with concept inclusion**, showing the tradeoff between steering strength and answer quality.
354
 
355
+ The explicit inclusion metric (presence of the word *'eiffel'*) is only partially correlated with the LLM-judge concept inclusion metric (0.45), showing that the model can indeed reference the Eiffel Tower without explicitly mentioning it (we also observed that *Eiffel* is sometimes misspelled, but that was still considered as a valid reference by the LLM judge).
356
 
357
  We see that the **repetition metric is strongly anticorrelated with fluency and instruction following** (-0.9 for both).
358
 
359
  Finally, log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also to concept inclusion, reflecting that referencing the Eiffel Tower often leads to more surprising answers.
360
 
361
+ This analysis shows that **although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers**.
362
+ This is valuable as it means we can use them as a guide for optimization, without having to always rely on costly LLM evaluations. Even if the final evaluation will have to be done with LLM-judge metrics.
363
 
364
  ## 4. Steering and generation improvements
365
 
366
  Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to ensure consistent activations, and repetition penalty to prevent the gibberish mode.
367
 
368
+ First, we tested clamping the activations rather than using the natural additive scheme.
369
+ Intuitively, this provides two potential benefits. First, it prevents the model from going to excessively high activations. In the additive scheme, these may result from steering applied to activations that are already high because of the influence of the previous tokens outputted by the model. On the other hand, clamping ensures that the feature is always activated at a certain level. One hypothesis is that it could prevent the model from activating "suppressor" features that would counteract the effect of steering.
370
 
371
+ This clamping approach was used by Anthropic in their Golden Gate demo, but the AxBench paper found it less effective than the addition scheme for Gemma models. We decided to test it in our case.
372
 
373
  ### 4.1 Clamping
374
 
 
376
 
377
  <HtmlEmbed src="d3-evaluation-configurable.html" data="evaluation_summary.json" config="clamp" />
378
 
379
+ We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**. The fact that concept inclusion (but not fluency or instruction following) is improved suggests that **clamping might help counteract some suppressor features preventing the Eiffel Tower concept from being fully activated**, however confirming this hypothesis would require further investigation.
380
 
381
  We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
382
 
 
389
 
390
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
391
  To mitigate this, we tried applying a lower temperature (0.5), and apply a repetition penalty during generation.
392
+ This technique involves penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
393
+ We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the generation API in 🤗Transformers (the repetition penalty implementation described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
394
 
395
+ Applying a repetition penalty reduces the 3-gram repetition as expected, and has **a clear positive effect on fluency, while not harming concept inclusion and instruction following.**
396
 
397
+ (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
398
 
399
+ <Note title="Tuning generation parameters improve fluency and instruction following" variant="success">
400
  Using a lower temperature (0.5) and applying a modest repetition penalty (1.1) during generation significantly reduces repetitions in the output. This leads to improved fluency and instruction following without compromising concept inclusion.
401
  </Note>
402
 
403
 
404
  ## 5. Multi-Layer optimization
405
 
406
+ Even after these improvements, we still found that steering with a single SAE feature proved insufficient, with concept inclusion lying way below the maximum possible value of 2.0.
407
+ Since our investigation using Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
408
 
409
+ Indeed it has been reported that common phenomena are **feature redundancy and feature splitting**. These phenomena occur when a concept is represented by several features that are often co-activated or are responsible for the same concept in slightly different contexts. The sparsity constraint used during SAE training tends to favor such splitting, as it is often more efficient to use several features that activate less often, than a single feature that would activate more often.
410
 
411
+ These phenomena suggest that **steering only one of those features therefore be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
412
 
413
  ### 5.1 Layer and features selection
414
  In total, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
415
 
416
+ We looked for those features using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" or "iff".
417
 
418
+ Among those 19 features, we selected all the features located in the intermediate layers 11, 15, 19 and 23. We decided to exclude features in earlier layers (six features in layer 3 and three features in layer 7) or later layers (two features in layer 27). We made this choice because features in intermediate layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate features for our multi-layer steering.
419
 
420
  ### 5.2 Optimization methodology
421
 
422
+ Finding the optimal steering coefficients for multiple features presents several challenges:
423
  - First, the parameter space grows with the number of features, making grid search quickly intractable.
424
  - Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
425
  - Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
426
 
427
+ To address these challenges, we used **Bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero and hence non-informative.
428
 
429
  #### 5.2.1 Cost function
430
 
431
  Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and leads to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
432
 
433
+ To mitigate this, we defined an auxiliary cost function that would be used when the harmonic mean is zero. Since our *surprise* and *rep3* metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We considered an auxiliary cost function of the form
434
  $$
435
  \mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
436
  $$
437
+ We chose target surprise $s_0$ and weight $k$ to maximize the correlation with the mean of LLM judge metrics (leading to $s_0 = 1.2$ and $k=3$).
438
 
439
+ Overall, our cost function was defined as the harmonic mean of LLM-judge metrics, and we penalized it with a small fraction (0.05) of the auxiliary cost when the harmonic mean was zero, to provide some signal to the optimizer.
440
 
441
  #### 5.2.2 Dealing with noise
442
 
443
+ Ideally, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
444
  However, each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
445
 
446
+ This is a black-box optimization problem, where each evaluation of the target function is costly (as it involves generating a full answer from the model) and noisy (as it depends on the prompt and the sample). To tackle this, we decided to rely on **Bayesian optimization**.
447
 
448
+ Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly black-box optimization, while being able to handle noisy evaluations. To mitigate the noise, we could average the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy functions, performing Bayesian optimization directly on the raw function is more efficient than averaging multiple noisy evaluations for each point.
449
 
450
  #### 5.2.3 Bayesian optimization
451
 
452
+ The idea behind BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. With each evaluation, we update the GP model, and iteratively refine our surrogate of the target function.
453
 
454
+ To do this, we used the [BoTorch library](https://botorch.org), which provides a flexible framework to perform BO using PyTorch. More details are given in the appendix.
455
 
456
  ### 5.3 Results of multi-layer optimization
457
 
458
+ We first performed optimization using only 2 features (from layer 15 and layer 19) and then 8 features (from layers 11, 15, 19 and 23), based on the hypothesis that steering the upper-middle layer is likely to be more effective to activate high-level concepts.
459
 
460
  Results are shown below and compared to single-layer steering.
461
 
 
463
 
464
  As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering.
465
 
466
+ This reflects the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Additionally, we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
467
 
468
+ Overall, **these disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency**.
469
 
470
+ One possible explanation is the difficulty of finding the true optimum, as the harmonic mean metric is very noisy and hard to optimize in the high-dimensional space.
471
 
472
+ Another plausible explanation could be that **the selected features are actually redundant rather than complementary**, and that steering one of them is sufficient to fully activate the concept. Investigating this would require monitoring the activation changes in subsequent layers' features when steering multiple features. For instance, for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
473
 
474
  <Note title="More features don't necessarily mean better steering." variant="success">
475
  Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
 
479
 
480
  ### 6.1 Main conclusions
481
 
482
+ In this study, we demonstrated the use of sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder initially expected, and finding good steering coefficients is not easy.
483
 
484
+ First, we showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
485
 
486
  Using the optimum found with auxiliary metrics, we showed that combining multiple features representing the same concept only leads to marginal improvements in concept inclusion, while maintaining fluency and instruction following. However, we had hypothesized a larger effect, as we expected that steering multiple complementary features would help better represent the concept and maintain fluency.
487
 
488
+ This may be because the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. Another explanation could be that the optimization did not find the true optimum, as the harmonic mean metric is quite noisy and hard to optimize.
489
+
490
+ Overall, our results are in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method. Our results also seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, these results are hard to generalize and our work is not directly comparable to the AxBench results, since they use different model, different concepts, different SAEs.
491
 
492
+ ### 6.2 Future Directions
493
 
494
+ This investigation opens several avenues for future work that could not only improve steering procedures but also reveal fundamental insights about activation patterns in LLMs. These include:
495
 
496
+ - **Investigate clamping:** Why does clamping helps in our case, similar to Anthropic, while AxBench found the opposite? One hypothesis is that it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model activate suppressor features to try to compensate for the added steering vector. This suggests an analogy with biology, where signaling pathways are often regulated by negative feedback loops. An interesting direction could be to analyze the cases where the model tries to "backtrack", e.g. outputting *"I'm the Eiffel Tower. No, actually I'm not."* By analyzing the activations just before the "No", can we highlight some *regulatory/suppressor features* that try to suppress the Eiffel Tower concept when it has been overactivated?
497
+ - **Determine why steering multiple features achieves only marginal improvement:** Investigate complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
498
+ - **Perform a failure analysis** on the prompts where steering fails (about 20% have at least one metric with a zero rating). Is there a pattern?
499
+ - **Check other concepts and other models**, and determine if some layers are more effective than others. In particular incorporate earlier and later layer, see if it helps the multi-layer steering.
500
+ - **Vary the temporal steering pattern**, for instance steer either only the prompt, or the generated answer; possibly use a periodic steering ?
501
+ - **Investigate wording in the "prompt engineering" case**. For now, the prompted model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Explore whether a more natural integration is possible. Does it show up in the activation pattern ? For instance, after mentioning the Eiffel tower, does the model activate regulatory features to prevent further mentions ?
502
 
503
+ We plan to explore some of these directions in future work.
 
 
 
 
 
504
 
505
  ---
506
 
507
  **Code** is available [here](https://github.com/scienceetonnante/eiffel-tower-llama)
508
 
509
+ **Acknowledgments:** Thanks to the NDIF team and especially Jaden Fiotto-Kaufman for help using `nnsight`, to Thom Wolf and Leandro von Werra for useful discussions, to Clémentine Fourrier for reading a first draft of the blog post, and to Thibaud Frere for help using his excellent [Bringing Paper To Life](https://huggingface.co/spaces/tfrere/research-article-template) blog post template.
510
 
511
  ---
512
 
 
534
 
535
  We considered a simple Gaussian Process (GP) model with an RBF kernel.
536
  At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
537
+ Then we select a promising candidate using the `qNoisyExpectedImprovement` acquisition function, which balances exploration and exploitation. This acquisition function is well-suited for noisy functions, as it takes into account the noise in the observations.
538
 
539
  For domain search, as we know that activation magnitude grows roughly linearly with layer index, we expect that the optimal steering coefficient for a feature in layer $l$ should scale with $l$.
540
  We used the reduced parameterization presented earlier, searching for an optimal value in the range $[0,1]$:
 
546
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
547
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
548
 
549
+ Performing gradient descent on the GP posterior is very cheap since it only involves differentiating the kernel function.
550
  We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being upper confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.