Spaces:
Running
Running
Improve text
Browse files- app/src/content/article.mdx +9 -10
app/src/content/article.mdx
CHANGED
|
@@ -52,14 +52,14 @@ However, as far as I know, **nobody has tried to reproduce something similar to
|
|
| 52 |
|
| 53 |
The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
|
| 54 |
|
| 55 |
-
By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering.
|
| 56 |
|
| 57 |
**Our main findings are:**
|
| 58 |
-
<Note title="" variant="success">
|
| 59 |
-
-
|
| 60 |
-
-
|
| 61 |
-
-
|
| 62 |
-
-
|
| 63 |
</Note>
|
| 64 |
|
| 65 |
<iframe
|
|
@@ -183,8 +183,7 @@ For that, they prompted *GPT-4o mini* to act as a judge and assess independently
|
|
| 183 |
|
| 184 |
For each of those three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
|
| 185 |
|
| 186 |
-
We decided to use an identical approach, using the more recent open-source model *GPT-OSS
|
| 187 |
-
Below is an example of the prompt we used to assess concept inclusion.
|
| 188 |
|
| 189 |
```text
|
| 190 |
[System]
|
|
@@ -400,7 +399,7 @@ import evaluation_clamp_gen from './assets/image/evaluation2_clamp_gen.png'
|
|
| 400 |
|
| 401 |
The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
|
| 402 |
|
| 403 |
-
We therefore opted for clamping, in line with the choice made by Anthropic.
|
| 404 |
|
| 405 |
### 4.2 Generation parameters
|
| 406 |
|
|
@@ -479,7 +478,7 @@ import evaluation_final from './assets/image/evaluation3_multiD.png'
|
|
| 479 |
|
| 480 |
As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
|
| 481 |
|
| 482 |
-
Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
|
| 483 |
|
| 484 |
---
|
| 485 |
|
|
|
|
| 52 |
|
| 53 |
The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
|
| 54 |
|
| 55 |
+
By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example, our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
|
| 56 |
|
| 57 |
**Our main findings are:**
|
| 58 |
+
<Note title="Our Main Findings" variant="success">
|
| 59 |
+
- **The steering 'sweet spot' is smaller than you think.** The optimal steering strength is roughly half the magnitude of a layer's typical activation. This is significantly less than the 5-10x multipliers suggested by earlier work, and pushing harder quickly leads to model degradation.
|
| 60 |
+
- **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but directly contradicts the findings reported in AxBench.
|
| 61 |
+
- **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features is the key to robust control.
|
| 62 |
+
- **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
|
| 63 |
</Note>
|
| 64 |
|
| 65 |
<iframe
|
|
|
|
| 183 |
|
| 184 |
For each of those three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
|
| 185 |
|
| 186 |
+
We decided to use an identical approach, using the more recent open-source model *GPT-OSS*, which has shown strong capabilities in reasoning tasks, superior to GPT-4o mini in many benchmarks. Below is an example of the prompt we used to assess concept inclusion.
|
|
|
|
| 187 |
|
| 188 |
```text
|
| 189 |
[System]
|
|
|
|
| 399 |
|
| 400 |
The image below shows the results of clamping compared to the additive scheme. We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
|
| 401 |
|
| 402 |
+
We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
|
| 403 |
|
| 404 |
### 4.2 Generation parameters
|
| 405 |
|
|
|
|
| 478 |
|
| 479 |
As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
|
| 480 |
|
| 481 |
+
Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. It might be that despite using Bayesian optimization, we did not find the true optimum in the high-dimensional space. Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
|
| 482 |
|
| 483 |
---
|
| 484 |
|