dlouapre HF Staff commited on
Commit
0903efa
·
1 Parent(s): 9a0ff02

Improve text

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +9 -10
app/src/content/article.mdx CHANGED
@@ -52,14 +52,14 @@ However, as far as I know, **nobody has tried to reproduce something similar to
52
 
53
  The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
54
 
55
- By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering.
56
 
57
  **Our main findings are:**
58
- <Note title="" variant="success">
59
- - Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than Anthropic suggested.
60
- - Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. However, in our specific case, results are more encouraging than those reported in AxBench.
61
- - Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency. This is similar to the approach used in the Golden Gate Claude demo, but opposite to the findings from AxBench.
62
- - Contrary to one of our initial hypothesis, steering multiple features simultaneously leads to only marginal improvements.
63
  </Note>
64
 
65
  <iframe
@@ -183,8 +183,7 @@ For that, they prompted *GPT-4o mini* to act as a judge and assess independently
183
 
184
  For each of those three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
185
 
186
- We decided to use an identical approach, using the more recent open-source model *GPT-OSS*.
187
- Below is an example of the prompt we used to assess concept inclusion.
188
 
189
  ```text
190
  [System]
@@ -400,7 +399,7 @@ import evaluation_clamp_gen from './assets/image/evaluation2_clamp_gen.png'
400
 
401
  The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
402
 
403
- We therefore opted for clamping, in line with the choice made by Anthropic.
404
 
405
  ### 4.2 Generation parameters
406
 
@@ -479,7 +478,7 @@ import evaluation_final from './assets/image/evaluation3_multiD.png'
479
 
480
  As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
481
 
482
- Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
483
 
484
  ---
485
 
 
52
 
53
  The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
54
 
55
+ By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example, our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
56
 
57
  **Our main findings are:**
58
+ <Note title="Our Main Findings" variant="success">
59
+ - **The steering 'sweet spot' is smaller than you think.** The optimal steering strength is roughly half the magnitude of a layer's typical activation. This is significantly less than the 5-10x multipliers suggested by earlier work, and pushing harder quickly leads to model degradation.
60
+ - **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but directly contradicts the findings reported in AxBench.
61
+ - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features is the key to robust control.
62
+ - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
63
  </Note>
64
 
65
  <iframe
 
183
 
184
  For each of those three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
185
 
186
+ We decided to use an identical approach, using the more recent open-source model *GPT-OSS*, which has shown strong capabilities in reasoning tasks, superior to GPT-4o mini in many benchmarks. Below is an example of the prompt we used to assess concept inclusion.
 
187
 
188
  ```text
189
  [System]
 
399
 
400
  The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
401
 
402
+ We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
403
 
404
  ### 4.2 Generation parameters
405
 
 
478
 
479
  As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
480
 
481
+ Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. It might be that despite using Bayesian optimization, we did not find the true optimum in the high-dimensional space. Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
482
 
483
  ---
484