dlouapre HF Staff commited on
Commit
340e936
·
1 Parent(s): 80abb25

Adding Clementine input

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +24 -28
app/src/content/article.mdx CHANGED
@@ -34,25 +34,22 @@ import Glossary from '../components/Glossary.astro';
34
  import Stack from '../components/Stack.astro';
35
 
36
 
37
- In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
38
- This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
39
 
40
- Although this demo led to hilarious conversations that have been widely shared on social media, it was shut down after 24 hours.
41
 
42
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
43
 
44
  <Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
45
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
46
 
47
- Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many. See for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or [Feature Steering for Reliable and Expressive AI Engineering](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering) by GoodFire AI.
48
 
49
- However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo.** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude?
50
 
51
- The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
52
 
53
- By doing this, we will realize that steering a model with activation vectors extracted from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example — the Eiffel Tower — our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
54
-
55
- **Our main findings:**
56
  <Note title="" variant="success">
57
  - **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. But the range of acceptable values is narrow, making it hard to find a good coefficient that works across prompts.
58
  - **Clamping is more effective than adding.** We found that clamping activations at a fixed value improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
@@ -71,8 +68,8 @@ By doing this, we will realize that steering a model with activation vectors ext
71
 
72
  ### 1.1 Model steering and sparse autoencoders
73
 
74
- Steering a model consists in modifying its internal activations *during generation*, in order to change its behavior.
75
- This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
76
 
77
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
78
  More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is generally normalized and scaled by a coefficient $\alpha$,
@@ -81,20 +78,21 @@ x^l \to x^l + \alpha v.
81
  $$
82
  The steering vector $v$ is generally chosen to represent a certain concept, and the steering coefficient $\alpha$ controls the strength of the intervention.
83
 
84
- The question is then how to find a suitable steering vector $v$ that would represent the desired concept.
85
- Several methods have been proposed, for instance computing a steering vector from the difference of average activations between two sets of prompts (one set representing the concept, the other not).
 
 
 
 
 
86
 
87
- However, a more principled approach is to use **sparse autoencoders (SAEs)**, which are trained to learn a sparse representation of the internal activations of a model.
88
- SAEs are trained in an unsupervised manner, on the activations of a model on a large corpus of text.
89
- The idea is that the learned representation will capture the main features of the activations, and that some of those features will correspond to meaningful concepts.
90
 
91
- After training, SAEs provide a dictionary of features, each represented by a vector in the original activation space, but those features do not come with labels or meanings.
92
- To identify the meaning of a feature, we can look at the logits it tends to promote, or at the prompts that lead to the highest activations of that feature.
93
- This interpretation step is tedious, but can be greatly facilitated by using auto-interpretability techniques based on large language models.
94
 
95
- SAEs were introduced in the context of mechanistic interpretability and have been used since then by several teams to analyze large language models.
96
- Interestingly, SAEs can be used to provide steering vectors using the columns of the decoder matrix, which are vectors in the original activation space.
97
- As shown in the Golden Gate Claude demo, those vectors can be used to steer the model toward a certain concept.
98
 
99
  ### 1.2 Neuronpedia
100
 
@@ -102,15 +100,13 @@ To experience steering a model yourself, the best starting point is [Neuronpedia
102
 
103
  Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
104
 
105
- We will be using Llama 3.1 8B Instruct, and [SAEs published by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct). Those SAEs have been trained on residual-stream output at layers 3, 7, 11, 15, 19, 23 and 27, with a 131,072-feature dictionary, for a representation space dimension of 4096 (expansion factor of 32), and BatchTopK $k = 64$, see [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models )
106
 
107
- Thanks to the search interface on Neuronpedia, we can look for candidate features representing the Eiffel Tower. With a simple search, many such features can be found in layers 3 to 27 (recall that Llama 3.1 8B has 32 layers).
108
 
109
- According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
110
- So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
111
- Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
112
 
113
- Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. In the SAE data published on Neuronpedia, we found only one clear feature referencing the Eiffel Tower, feature 21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
114
 
115
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
116
 
 
34
  import Stack from '../components/Stack.astro';
35
 
36
 
37
+ In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude). In this experiment, researchers changed the behavior of the Claude LLM, making it answer as if it was the Golden Gate (or referring the Golden Gate systematically)... witout any prompting tweak! They actually steered the model's behavior by **changing its activations** at inference (using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling], we'll see how later). Although this demo led to hilarious conversations that have been widely shared on social media, it was shut down after 24 hours... and as far as we know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo**!
 
38
 
39
+ So we decided to give it a try - let's see what we found, and how you can steer models too! (Of course, for this demo, we'll be using an open-source model!)
40
 
41
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
42
 
43
  <Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
44
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
45
 
46
+ For context, since Golden Gate Claude, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling]. Steering activations sparked the interest of many: see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or [Feature Steering for Reliable and Expressive AI Engineering](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering) by GoodFire AI. However, recently, the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*.
47
 
48
+ The aim of this article is to investigate if and how SAEs can indeed be used to reproduce **Golden Gate Claude, but with a lightweight open-source model**. For this, we'll use *Llama 3.1 8B Instruct*, but since I live in Paris...let's make it obsessed with the Eiffel Tower! As we'll see together, it's not as trivial as one might think!
49
 
50
+ Note: While we focus on a single, concrete example the Eiffel Tower our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
51
 
52
+ **Our main findings (we'll explain all in detail below):**
 
 
53
  <Note title="" variant="success">
54
  - **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. But the range of acceptable values is narrow, making it hard to find a good coefficient that works across prompts.
55
  - **Clamping is more effective than adding.** We found that clamping activations at a fixed value improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
 
68
 
69
  ### 1.1 Model steering and sparse autoencoders
70
 
71
+ Steering a model consists in modifying its internal activations *at inference*, in order to change its behavior when it's generating new text.
72
+ This differs from fine-tuning, where you modify the weights of a base model by extra training, to obtain a new model with the desired behavior.
73
 
74
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
75
  More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is generally normalized and scaled by a coefficient $\alpha$,
 
78
  $$
79
  The steering vector $v$ is generally chosen to represent a certain concept, and the steering coefficient $\alpha$ controls the strength of the intervention.
80
 
81
+ Surely, at this point you wonder... How do I find a suitable steering vector $v$ that represents my desired concept?
82
+
83
+ A naive approach would be to compute a steering vector from the difference of average activations between two sets of prompts (one set representing the concept, the other not).
84
+
85
+ However, a more principled approach relies on **sparse autoencoders (SAEs)**, trained to learn a sparse representation of the internal activations of a model in an unsupervised manner! (See TODO:REF for details on how to train SAEs). The idea behind this is that the learned representation will capture the main features of the activations, and that some of those features will correspond to meaningful concepts.
86
+
87
+ Once trained, an SAE provides a dictionary of interesting features, each represented by a vector in the original activation space, but... those features do not come with labels or meanings.
88
 
89
+ To identify the meaning of a feature, we can do two things:
90
+ - look at the logits it tends to promote (TODO: EXPLAIN)
91
+ - look at the prompts that lead to the highest activations of that feature.
92
 
93
+ This interpretation step is tedious, but can be greatly facilitated by using auto-interpretability techniques based on large language models (TODO: HOW?).
 
 
94
 
95
+ Once you have identified relevant features, you can then use them to steer your original LLM towards the related concept, by using the columns of the decoder matrix, which are vectors in the original activation space. (TODO: ADD schematic)
 
 
96
 
97
  ### 1.2 Neuronpedia
98
 
 
100
 
101
  Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
102
 
103
+ Let's do this together step by step, using Llama 3.1 8B Instruct, and [SAEs published by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct). (In detail, those SAEs have been trained on residual-stream output at layers 3, 7, 11, 15, 19, 23 and 27, with a 131,072-feature dictionary, for a representation space dimension of 4096 (expansion factor of 32), and BatchTopK $k = 64$, see [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models))
104
 
105
+ Using the search interface on Neuronpedia, we can literally look for candidate features representing the Eiffel Tower! With a simple search, it looks like such features can be found in layers 3 to 27 (so most of Llama 3.1 8B's 32 layers).
106
 
107
+ According to analyses by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens. So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts. Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn't disclose which one since their architecture is not public.
 
 
108
 
109
+ Since Llama 3.1 8B has 32 layers, let's take a look in the middle too, and focus on layer 15. In the SAE data published on Neuronpedia, we found only one clear feature referencing the Eiffel Tower there, feature 21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
110
 
111
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
112