Spaces:
Running
Running
Working on text
Browse files- app/.astro/settings.json +1 -1
- app/src/content/article.mdx +50 -104
- app/src/content/embeds/banner.html +2 -2
app/.astro/settings.json
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 58
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:43c01d5340b1eb3be37a5c27848e51cc4675966370507e32a49edff18f1278ea
|
| 3 |
size 58
|
app/src/content/article.mdx
CHANGED
|
@@ -50,11 +50,11 @@ The aim of this article is to investigate how **SAEs can be used to reproduce a
|
|
| 50 |
|
| 51 |
But since I live in Paris...**let’s make it obsessed about the Eiffel Tower!**
|
| 52 |
|
| 53 |
-
Doing this, we will realize that steering a model with vectors coming from SAEs is harder than we might have thought. But we will devise
|
| 54 |
|
| 55 |
-
## Steering with SAEs
|
| 56 |
|
| 57 |
-
### Some background on steering and Sparse AutoEncoders
|
| 58 |
|
| 59 |
Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
|
| 60 |
This is thus different from finetuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
|
|
@@ -81,7 +81,7 @@ SAEs were introduced in the context of mechanistic interpretability and have bee
|
|
| 81 |
Interestingly, SAEs can be used to provide steering vectors using the columns of the decoder matrix, which are vectors in the original activation space.
|
| 82 |
As shown in the Golden Gate Claude demo, those vectors can be used to steer the model towards a certain concept.
|
| 83 |
|
| 84 |
-
### Neuronpedia
|
| 85 |
|
| 86 |
To experience steering a model by yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
|
| 87 |
|
|
@@ -94,7 +94,8 @@ Thanks to the search interface on Neuronpedia, we can look for candidate feature
|
|
| 94 |
|
| 95 |
According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in latest layers activate when the model is about to output certain tokens.
|
| 96 |
So common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
|
| 97 |
-
Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one
|
|
|
|
| 98 |
|
| 99 |
The corresponding Neuronpedia page is included below, and we can in particular see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
|
| 100 |
|
|
@@ -126,7 +127,7 @@ In their own paper, Anthropic mentioned using values ranging from **5 to 10 time
|
|
| 126 |
|
| 127 |
It seems that — at least with a small open source model — **steering with SAEs is harder than we might have thought**.
|
| 128 |
|
| 129 |
-
### The AxBench paper
|
| 130 |
|
| 131 |
Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering procedures, and indeed found using SAEs as one of the least effective methods.
|
| 132 |
Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it cleanly references the target concept, while simultaneously maintaining fluency and instruction following behavior.
|
|
@@ -143,13 +144,13 @@ Or because they carefully selected a feature that was particularly well suited f
|
|
| 143 |
To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
|
| 144 |
and see if we can improve on the baseline steering method as implemented on Neuronpedia.
|
| 145 |
|
| 146 |
-
## Metrics, we need metrics!
|
| 147 |
|
| 148 |
To judge the quality of a steered model like our Eiffel Tower Llama, we cannot only really on our subjective feelings.
|
| 149 |
Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
|
| 150 |
First, let's not reinvent the wheel and use the same metrics as AxBench.
|
| 151 |
|
| 152 |
-
### The AxBench LLM-judge metrics
|
| 153 |
|
| 154 |
The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
|
| 155 |
An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
|
|
@@ -187,66 +188,46 @@ Since a zero in any of the individual metrics lead to a zero harmonic mean, the
|
|
| 187 |
|
| 188 |
On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting at about 0.9 (for a maximum of 2.0).
|
| 189 |
|
| 190 |
-
###
|
| 191 |
|
| 192 |
-
|
| 193 |
-
Since
|
|
|
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
|
| 199 |
-
|
| 200 |
-
- *How do you handle disagreement with someone you care about?*
|
| 201 |
-
- *Give me some ideas for starting a business.*
|
| 202 |
-
- *Give me a short pitch for a science fiction movie.*
|
| 203 |
-
|
| 204 |
-
The idea was to start from a diverse set of prompts, while being representative of the intended use of the steered model.
|
| 205 |
-
For instance, we excluded prompts that were about writing code, or were asking explicitly for just a yes/no answer.
|
| 206 |
-
|
| 207 |
-
Importantly, we decided to use **no system prompt**. Our goal is to investigate the effect of steering alone, without any additional instruction to the model.
|
| 208 |
-
(This is apparently also the choice of the steering applet on Neuronpedia)
|
| 209 |
-
We can notice that in the case of the Golden Gate Claude demo, we don't know what system prompt was used.
|
| 210 |
-
Since the Golden Gate Claude model was still trying to behave as a helpful assistant, we might guess that a system prompt was used, but we don't know what it was and whether it was tailored for the task.
|
| 211 |
-
|
| 212 |
-
### Auxiliary quantitative metrics
|
| 213 |
|
| 214 |
Although LLM-judge metrics provide a recognized assessment of the quality of the answers, those metrics have two drawbacks.
|
| 215 |
-
First, they are costly to compute, as each evaluation requires
|
| 216 |
-
Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization.
|
| 217 |
-
Even considering the harmonic mean of the three metrics, we only have 5 possible values (0.0, 1.0, 1.2, 1.5, 2.0).
|
| 218 |
|
| 219 |
Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**.
|
| 220 |
|
| 221 |
-
####
|
| 222 |
|
| 223 |
Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
|
| 224 |
-
|
| 225 |
-
For that we decided to monitor the (minus) log probability (per token) under the reference model.
|
| 226 |
-
This is essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.
|
| 227 |
|
| 228 |
-
|
| 229 |
|
| 230 |
-
|
| 231 |
-
On the one hand, a low value would indicate answers that would have hardly been surprising in the reference model.
|
| 232 |
-
On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
|
| 233 |
-
|
| 234 |
-
#### n-gram repetition
|
| 235 |
|
| 236 |
We can see from experimenting on Neuronpedia that steering too hard often leads to repetitive gibberish.
|
| 237 |
To detect that, we decided to monitor **the fraction of unique n-grams in the answers**.
|
| 238 |
Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
|
| 239 |
-
We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all.
|
| 240 |
-
For short answers, values above 0.2 generally tend to correspond to annoying repetitions that impart the fluency of the answer.
|
| 241 |
|
| 242 |
-
#### Explicit concept inclusion
|
| 243 |
|
| 244 |
Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for **the occurrence of the word *eiffel* in the answer** (case-insensitive).
|
| 245 |
We are aware that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
|
| 246 |
-
(For instance, when referring to *a large metal structure built in Paris.*)
|
| 247 |
|
| 248 |
|
| 249 |
-
## Optimizing steering coefficient for a single feature
|
| 250 |
|
| 251 |
From the trained SAEs, we can extract steering vectors by using the columns of the decoder matrix.
|
| 252 |
The simplest steering scheme then involves adding that steering vector $v$ scaled by a steering coefficient to the activations at layer $l$,
|
|
@@ -258,7 +239,7 @@ $$
|
|
| 258 |
But as we have seen on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
|
| 259 |
To find the optimal coefficient, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the six metrics described in the previous section.
|
| 260 |
|
| 261 |
-
###
|
| 262 |
|
| 263 |
We use the `nnsight` library to perform the steering and generation.
|
| 264 |
This library, developed by NDIF allows to easily monitor and manipulate the internal activations of transformer models during generation.
|
|
@@ -281,7 +262,7 @@ with llm.generate() as tracer:
|
|
| 281 |
answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
|
| 282 |
```
|
| 283 |
|
| 284 |
-
### Range of steering coefficients
|
| 285 |
|
| 286 |
Our goal in this first sweep is to find a steering coefficient that would lead to a significant activation of the steering feature, but without going too far and producing gibberish.
|
| 287 |
|
|
@@ -309,10 +290,9 @@ $$
|
|
| 309 |
$$
|
| 310 |
|
| 311 |
|
| 312 |
-
### Results of a 1D grid search sweep
|
| 313 |
|
| 314 |
-
|
| 315 |
-
Temperature was set to 0.5 and maximum number of generated token to 256.
|
| 316 |
|
| 317 |
The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
|
| 318 |
The top row displays the three LLM-judge metrics, while the bottom row displays our three auxiliary metrics.
|
|
@@ -353,7 +333,7 @@ This conclusion is in line with the results from AxBench showing that steering w
|
|
| 353 |
|
| 354 |
Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
|
| 355 |
|
| 356 |
-
### Correlations between metrics
|
| 357 |
|
| 358 |
From the results of this sweep, we can compute the correlations between our six metrics to see how they relate to each other.
|
| 359 |
|
|
@@ -379,16 +359,16 @@ From that, we can devise a useful proxy to find good steering coefficients:
|
|
| 379 |
- for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
|
| 380 |
- for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
|
| 381 |
|
| 382 |
-
##
|
| 383 |
|
| 384 |
-
|
| 385 |
|
| 386 |
First, we tried to clamp the activations rather than using the natural additive scheme.
|
| 387 |
Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
|
| 388 |
|
| 389 |
This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that on their case it was less effective than the addition scheme. We decided to test it on our case.
|
| 390 |
|
| 391 |
-
### Clamping
|
| 392 |
|
| 393 |
We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
|
| 394 |
|
|
@@ -400,10 +380,10 @@ The image below shows the results of clamping compared to the additive scheme. W
|
|
| 400 |
|
| 401 |
We thus decided to prefer clamping the activation, in line with the choice made by Anthropic.
|
| 402 |
|
| 403 |
-
###
|
| 404 |
|
| 405 |
We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
|
| 406 |
-
To mitigate that, we tried to apply a repetition penalty during generation.
|
| 407 |
This is a simple technique that consists in penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
|
| 408 |
We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
|
| 409 |
|
|
@@ -415,7 +395,7 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
|
|
| 415 |
|
| 416 |
(Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
|
| 417 |
|
| 418 |
-
## Multi-Layer optimization
|
| 419 |
|
| 420 |
Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
|
| 421 |
Since our investigation on Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
|
|
@@ -424,29 +404,26 @@ Indeed it has been reported that common phenomenons are **feature redundancy and
|
|
| 424 |
|
| 425 |
Those phenomena mean that **steering only one of those features might thus be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
|
| 426 |
|
| 427 |
-
### Layer and features selection
|
| 428 |
Overall, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
|
| 429 |
|
| 430 |
We looked for those feature using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" of "iff".
|
| 431 |
|
| 432 |
Among those 19 features, we selected all the features located in the intermediary layers 11, 15, 19 and 23. We decided to leave aside features in earlier layers (six features in layer 3 and three features layer 7) or latest layers (two features in layer 27). This choice is motivated by the observations that features in intermediary layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate feature for our multi-layer steering.
|
| 433 |
|
| 434 |
-
### Optimization
|
| 435 |
-
To optimize the steering coefficients of each feature, we need to define a suitable target function.
|
| 436 |
-
Ideally, we would like to maximize concept inclusion, while maintaining fluency and instruction following.
|
| 437 |
-
In our evaluations, that was reflected by the harmonic mean of the three LLM-judge metrics, but as we have seen, that function is discreteand costly, so not very well suited for optimization.
|
| 438 |
|
| 439 |
-
|
| 440 |
|
| 441 |
-
|
| 442 |
|
| 443 |
-
|
| 444 |
$$
|
| 445 |
-
\text{
|
| 446 |
$$
|
| 447 |
-
This
|
| 448 |
|
| 449 |
-
|
| 450 |
|
| 451 |
In principle, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
|
| 452 |
But each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
|
|
@@ -455,7 +432,7 @@ We are in a situation of a black-box optimization, where each evaluation of the
|
|
| 455 |
|
| 456 |
Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly blackbox optimization, while being able to handle noisy evaluations. To mitigate the noise, we could have averaged the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy function, performing bayesian optimization directly on the raw function is known to be more effective than averaging multiple noisy evaluations for each point.
|
| 457 |
|
| 458 |
-
|
| 459 |
|
| 460 |
The idea beyond BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
|
| 461 |
|
|
@@ -469,31 +446,15 @@ We used the reduced parameterization presented earlier, searching for an optimal
|
|
| 469 |
To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
|
| 470 |
In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
|
| 471 |
|
| 472 |
-
|
| 473 |
-
At the end, we obtained a GP model that was a good surrogate of the target function and its uncertainty, especially in the most promising regions of the parameter space. From that GP posterior, we investigated the local minima using gradient descent.
|
| 474 |
-
|
| 475 |
-
### Gradient descent
|
| 476 |
|
| 477 |
Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
|
| 478 |
-
We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) +
|
| 479 |
-
|
| 480 |
-
Many of those gradient descents led out of the $\hat{\alpha]=1$ boundary of the search domain, and we discarded those runs.
|
| 481 |
-
Among the convergence points, we cluster them using Euclidian distance and selected the cluster with the largest number of points (corresponding to the more robust local minimum of the GP posterior).
|
| 482 |
|
| 483 |
-
|
| 484 |
-
|:-----:|:-------------:|:-----------:|
|
| 485 |
-
| 11 | 74457 | 1.03 |
|
| 486 |
-
| 11 | 18894 | 1.42 |
|
| 487 |
-
| 11 | 61463 | 1.77 |
|
| 488 |
-
| 15 | 21576 | 4.85 |
|
| 489 |
-
| 19 | 93 | 6.69 |
|
| 490 |
-
| 23 | 111898 | 10.3 |
|
| 491 |
-
| 23 | 40788 | 3.24 |
|
| 492 |
-
| 23 | 21334 | 1.38 |
|
| 493 |
|
| 494 |
-
### Evaluation on 6 metrics
|
| 495 |
|
| 496 |
-
|
| 497 |
|
| 498 |
Results are shown below and compared to single-layer steering with optimal coefficient $\alpha=8.5$.
|
| 499 |
|
|
@@ -501,22 +462,7 @@ import evaluation_final from './assets/image/evaluation_final.png'
|
|
| 501 |
|
| 502 |
<Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
|
| 503 |
|
| 504 |
-
As we can see, multi-layer steering leads to **a very clear improvement in concept inclusion** (1.70), while maintaining fluency and instruction following on par with optimized single-layer steering. Overall, the improvement in concept inclusion is about +0.83 compared to simple single layer steering, and +0.64 compared to single-layer steering with clamping and repetition penalty.
|
| 505 |
-
|
| 506 |
-
This corresponds to a large effect size (Cohen's d $>0.5$) which for 500 samples is statistically very significant ($p<<10^{-6}$).
|
| 507 |
-
|
| 508 |
-
### Harmonic mean comparison
|
| 509 |
-
|
| 510 |
-
The AxBench paper proposed to summarize the aggregated performance of a steering method using the harmonic mean of the three LLM-judge metrics.
|
| 511 |
-
We also computed that harmonic mean metric, and compared it across our different steering methods.
|
| 512 |
-
|
| 513 |
-
import evaluation_harmonic_mean from './assets/image/evaluation_harmonic_mean.png'
|
| 514 |
-
|
| 515 |
-
<Image src={evaluation_harmonic_mean} alt="Harmonic mean of metrics" caption="Harmonic mean of metrics. Left : Average and standard deviation for the different method. Right : Distribution of harmonic mean scores, where for instance 1.2 indicates one metric at 2 and the other two at 1." />
|
| 516 |
-
|
| 517 |
-
Again here, the effect size is huge, with a jump from 0.5 for simple single-layer steering to 1.2 for multi-layer steering.
|
| 518 |
|
| 519 |
-
Moreover, closer inspection of the distribution of harmonic mean scores (right panel) shows that optimized single-layer steering has a non-zero score only in about 1/3 of the cases, while for multi-layer steering, this fraction increases to about 3/4 of the cases. It shows that most of the time, the optimized steered model is able to score at least 1 on all three metrics.
|
| 520 |
|
| 521 |
## Conclusion & Discussion
|
| 522 |
|
|
|
|
| 50 |
|
| 51 |
But since I live in Paris...**let’s make it obsessed about the Eiffel Tower!**
|
| 52 |
|
| 53 |
+
Doing this, we will realize that steering a model with vectors coming from SAEs is harder than we might have thought. But we will devise several improvements over naive steering.
|
| 54 |
|
| 55 |
+
## 1. Steering with SAEs
|
| 56 |
|
| 57 |
+
### 1.1 Some background on steering and Sparse AutoEncoders
|
| 58 |
|
| 59 |
Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
|
| 60 |
This is thus different from finetuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
|
|
|
|
| 81 |
Interestingly, SAEs can be used to provide steering vectors using the columns of the decoder matrix, which are vectors in the original activation space.
|
| 82 |
As shown in the Golden Gate Claude demo, those vectors can be used to steer the model towards a certain concept.
|
| 83 |
|
| 84 |
+
### 1.2 Neuronpedia
|
| 85 |
|
| 86 |
To experience steering a model by yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
|
| 87 |
|
|
|
|
| 94 |
|
| 95 |
According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in latest layers activate when the model is about to output certain tokens.
|
| 96 |
So common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
|
| 97 |
+
Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
|
| 98 |
+
Since Llama 3.1 8B has 32 layers, we decided look at layer 15. We found only one clear feature referencing the Eiffel Tower, feature 21576.
|
| 99 |
|
| 100 |
The corresponding Neuronpedia page is included below, and we can in particular see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
|
| 101 |
|
|
|
|
| 127 |
|
| 128 |
It seems that — at least with a small open source model — **steering with SAEs is harder than we might have thought**.
|
| 129 |
|
| 130 |
+
### 1.3 The AxBench paper
|
| 131 |
|
| 132 |
Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering procedures, and indeed found using SAEs as one of the least effective methods.
|
| 133 |
Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it cleanly references the target concept, while simultaneously maintaining fluency and instruction following behavior.
|
|
|
|
| 144 |
To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
|
| 145 |
and see if we can improve on the baseline steering method as implemented on Neuronpedia.
|
| 146 |
|
| 147 |
+
## 2. Metrics, we need metrics!
|
| 148 |
|
| 149 |
To judge the quality of a steered model like our Eiffel Tower Llama, we cannot only really on our subjective feelings.
|
| 150 |
Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
|
| 151 |
First, let's not reinvent the wheel and use the same metrics as AxBench.
|
| 152 |
|
| 153 |
+
### 2.1 The AxBench LLM-judge metrics
|
| 154 |
|
| 155 |
The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
|
| 156 |
An LLM is in charge of rating the output of the steered model along three independent criteria: **concept inclusion, instruction following, and fluency.**
|
|
|
|
| 188 |
|
| 189 |
On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting at about 0.9 (for a maximum of 2.0).
|
| 190 |
|
| 191 |
+
### 2.2 Evaluation prompts
|
| 192 |
|
| 193 |
+
To evaluate our steered model, we need a set of prompts to generate answers to. Following the AxBench paper, we decided to use the Alpaca Eval dataset.
|
| 194 |
+
Since this dataset is made of about 800 instructions, we decided to split it randomly in two halves of 400 instructions each.
|
| 195 |
+
One half will be used for optimizing the steering coefficients and other hyperparameters, while the other half will be used for final evaluation. For final evaluation, we generated answers up to 512 tokens.
|
| 196 |
|
| 197 |
+
We use the simple system prompt *"You are a helpful assistant."* for all our experiments. However, for comparing steering methods with the simple prompting baseline, we use the prompt
|
| 198 |
|
| 199 |
+
*"You are a helpful assistant. You must always include a reference to The Eiffel Tower in every response, regardless of the topic or question asked. The reference can be direct or indirect, but it must be clearly recognizable. Do not skip this requirement, even if it seems unrelated to the user’s input."*.
|
| 200 |
|
| 201 |
+
### 2.3 Auxiliary quantitative metrics
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
|
| 203 |
Although LLM-judge metrics provide a recognized assessment of the quality of the answers, those metrics have two drawbacks.
|
| 204 |
+
First, they are costly to compute, as each evaluation requires three calls to a large language model.
|
| 205 |
+
Second, their scale is discrete and limited to three values, which makes it hard to use them as a target for numerical optimization. Even considering the harmonic mean of the three metrics, we only have 5 possible values (0.0, 1.0, 1.2, 1.5, 2.0).
|
|
|
|
| 206 |
|
| 207 |
Because of this, we considered **auxiliary metrics that could help us monitor the impact of our interventions, and be a useful target to guide numerical optimization**.
|
| 208 |
|
| 209 |
+
#### 2.3.1 Surprise within the reference model
|
| 210 |
|
| 211 |
Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
|
| 212 |
+
For that we decided to monitor the (minus) log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
|
|
|
|
|
|
|
| 213 |
|
| 214 |
+
Although the minus log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would have hardly been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
|
| 215 |
|
| 216 |
+
#### 2.3.2 n-gram repetition
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
|
| 218 |
We can see from experimenting on Neuronpedia that steering too hard often leads to repetitive gibberish.
|
| 219 |
To detect that, we decided to monitor **the fraction of unique n-grams in the answers**.
|
| 220 |
Using n=3 already leads to interesting insights, as it captures repetitions of words and short phrases.
|
| 221 |
+
We thus monitored the ratio of repeated 3-grams over total 3-grams in the answer. A value of 0.0 means that there is no repetition at all. For short answers, values above 0.2 generally tend to correspond to annoying repetitions that impart the fluency of the answer.
|
|
|
|
| 222 |
|
| 223 |
+
#### 2.3.3 Explicit concept inclusion
|
| 224 |
|
| 225 |
Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for **the occurrence of the word *eiffel* in the answer** (case-insensitive).
|
| 226 |
We are aware that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
|
| 227 |
+
(For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use beyond simple monitoring.
|
| 228 |
|
| 229 |
|
| 230 |
+
## 3. Optimizing steering coefficient for a single feature
|
| 231 |
|
| 232 |
From the trained SAEs, we can extract steering vectors by using the columns of the decoder matrix.
|
| 233 |
The simplest steering scheme then involves adding that steering vector $v$ scaled by a steering coefficient to the activations at layer $l$,
|
|
|
|
| 239 |
But as we have seen on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
|
| 240 |
To find the optimal coefficient, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the six metrics described in the previous section.
|
| 241 |
|
| 242 |
+
### 3.1 Steering with nnsight
|
| 243 |
|
| 244 |
We use the `nnsight` library to perform the steering and generation.
|
| 245 |
This library, developed by NDIF allows to easily monitor and manipulate the internal activations of transformer models during generation.
|
|
|
|
| 262 |
answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
|
| 263 |
```
|
| 264 |
|
| 265 |
+
### 3.2 Range of steering coefficients
|
| 266 |
|
| 267 |
Our goal in this first sweep is to find a steering coefficient that would lead to a significant activation of the steering feature, but without going too far and producing gibberish.
|
| 268 |
|
|
|
|
| 290 |
$$
|
| 291 |
|
| 292 |
|
| 293 |
+
### 3.3 Results of a 1D grid search sweep
|
| 294 |
|
| 295 |
+
For a first grid search, we used the set of 50 prompts, temperature was set to 1.0 and maximum number of generated token to 256.
|
|
|
|
| 296 |
|
| 297 |
The image below shows the results for each of our six metrics of the sweep over $\alpha$ for the feature 21576 in layer 15.
|
| 298 |
The top row displays the three LLM-judge metrics, while the bottom row displays our three auxiliary metrics.
|
|
|
|
| 333 |
|
| 334 |
Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
|
| 335 |
|
| 336 |
+
### 3.4 Correlations between metrics
|
| 337 |
|
| 338 |
From the results of this sweep, we can compute the correlations between our six metrics to see how they relate to each other.
|
| 339 |
|
|
|
|
| 359 |
- for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
|
| 360 |
- for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
|
| 361 |
|
| 362 |
+
## 4. Steering and generation improvements
|
| 363 |
|
| 364 |
+
We tried several simple improvements to the naive steering scheme.
|
| 365 |
|
| 366 |
First, we tried to clamp the activations rather than using the natural additive scheme.
|
| 367 |
Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
|
| 368 |
|
| 369 |
This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that on their case it was less effective than the addition scheme. We decided to test it on our case.
|
| 370 |
|
| 371 |
+
### 4.1 Clamping
|
| 372 |
|
| 373 |
We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
|
| 374 |
|
|
|
|
| 380 |
|
| 381 |
We thus decided to prefer clamping the activation, in line with the choice made by Anthropic.
|
| 382 |
|
| 383 |
+
### 4.2 Generation parameters
|
| 384 |
|
| 385 |
We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
|
| 386 |
+
To mitigate that, we tried to apply lower the temperature, and applu a repetition penalty during generation.
|
| 387 |
This is a simple technique that consists in penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
|
| 388 |
We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
|
| 389 |
|
|
|
|
| 395 |
|
| 396 |
(Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
|
| 397 |
|
| 398 |
+
## 5. Multi-Layer optimization
|
| 399 |
|
| 400 |
Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
|
| 401 |
Since our investigation on Neuronpedia revealed that **the Eiffel Tower concept was represented by many features in different layers**, we hypothesized that steering several of those features simultaneously could lead to better results.
|
|
|
|
| 404 |
|
| 405 |
Those phenomena mean that **steering only one of those features might thus be insufficient to fully activate the concept, or to activate it consistently across different prompts.** Moreover, activating one feature without the others might cause loss of fluency, as the model might experience activation patterns that are out of distribution compared to what it was trained on.
|
| 406 |
|
| 407 |
+
### 5.1 Layer and features selection
|
| 408 |
Overall, **we identified 19 candidate features**, located in layers 3, 7, 11, 15, 19, 23, and 27. Note that those layers were the only ones for which SAEs were available, so it is likely that other features representing the Eiffel Tower exist in other layers.
|
| 409 |
|
| 410 |
We looked for those feature using the search tool in Neuronpedia, and selected them based on their top activating prompts in the dataset. We kept only those features that unambiguously referenced the Eiffel Tower, and discarded features that seemed to be more generally about Paris, towers, famous landmarks in big cities, or simply tokens like "E" of "iff".
|
| 411 |
|
| 412 |
Among those 19 features, we selected all the features located in the intermediary layers 11, 15, 19 and 23. We decided to leave aside features in earlier layers (six features in layer 3 and three features layer 7) or latest layers (two features in layer 27). This choice is motivated by the observations that features in intermediary layers are more likely to represent abstract high-level concepts. This led us to select 8 candidate feature for our multi-layer steering.
|
| 413 |
|
| 414 |
+
### 5.2 Optimization methodology
|
|
|
|
|
|
|
|
|
|
| 415 |
|
| 416 |
+
#### 5.2.1 Cost function
|
| 417 |
|
| 418 |
+
Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and will lead to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
|
| 419 |
|
| 420 |
+
To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our surprise and rep3 metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We selected target values and weights that maximally correlates with the harmonic mean and from that build the following helper target function
|
| 421 |
$$
|
| 422 |
+
\text{cost} = |\text{surprise} - 1.2| + 3.3\ \text{rep3}
|
| 423 |
$$
|
| 424 |
+
This penalty cost is applied when the harmonic mean is zero, otherwise the cost is simply the negative harmonic mean.
|
| 425 |
|
| 426 |
+
#### 5.2.2 Dealing with noise
|
| 427 |
|
| 428 |
In principle, we want to minimize *the expected value of our target function over the distribution of prompts and samples*.
|
| 429 |
But each call to the steered model will effectively only give a noisy estimate of that target, evaluated on a single prompt and one sample.
|
|
|
|
| 432 |
|
| 433 |
Bayesian Optimization (BO) is known to be well-suited for multidimensional non-differentiable costly blackbox optimization, while being able to handle noisy evaluations. To mitigate the noise, we could have averaged the target function over several prompts and samples, but this would have been costly, especially when evaluating points that are not promising. For very noisy function, performing bayesian optimization directly on the raw function is known to be more effective than averaging multiple noisy evaluations for each point.
|
| 434 |
|
| 435 |
+
#### 5.2.3 Bayesian optimization
|
| 436 |
|
| 437 |
The idea beyond BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
|
| 438 |
|
|
|
|
| 446 |
To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
|
| 447 |
In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
|
| 448 |
|
| 449 |
+
#### 5.2.4 Gradient descent
|
|
|
|
|
|
|
|
|
|
| 450 |
|
| 451 |
Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
|
| 452 |
+
We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty.
|
|
|
|
|
|
|
|
|
|
| 453 |
|
| 454 |
+
#### 5.2.5 Clustering
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 455 |
|
|
|
|
| 456 |
|
| 457 |
+
### 5.3 Results of multi-layer optimization
|
| 458 |
|
| 459 |
Results are shown below and compared to single-layer steering with optimal coefficient $\alpha=8.5$.
|
| 460 |
|
|
|
|
| 462 |
|
| 463 |
<Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
|
| 464 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 465 |
|
|
|
|
| 466 |
|
| 467 |
## Conclusion & Discussion
|
| 468 |
|
app/src/content/embeds/banner.html
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
|
| 2 |
<div style="display: flex; justify-content: center;">
|
| 3 |
-
<img src="
|
| 4 |
alt="Eiffel Tower Llama"
|
| 5 |
style="max-width:80%; height:auto; border-radius:8px;" />
|
| 6 |
-
</div>
|
|
|
|
| 1 |
|
| 2 |
<div style="display: flex; justify-content: center;">
|
| 3 |
+
<img src="eiffel_tower_llama.png"
|
| 4 |
alt="Eiffel Tower Llama"
|
| 5 |
style="max-width:80%; height:auto; border-radius:8px;" />
|
| 6 |
+
</div>
|