Spaces:

dlouapre
/

eiffel-tower-llama

Running

App Files Files Community

dlouapre HF Staff commited on Nov 5

Commit

b59a1b3

1 Parent(s): ab75864

Improved next steps

Browse files

Files changed (1) hide show

app/src/content/article.mdx +6 -4

app/src/content/article.mdx CHANGED Viewed

@@ -503,14 +503,16 @@ A way to explain this lack of improvement could be that the selected features ar
 Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
 <Note variant="info" title="Possible next steps">
-    - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
-    - Check other layers for 1D optimization
-    - Check complementary vs redundancy by monitoring activation changes in subsequent layers' features.
     - Try other concepts, see if results are similar
     - Try with larger models, see if results are better
     - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
-    - Try to include earlier and later layers, see if it helps
     - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
 </Note>

 Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
 <Note variant="info" title="Possible next steps">
+    - **Failure analysis** on the cases where steering fails (about 20% have at least one zero metric). Is there a pattern?
+    - **Why steering multiple features achieves only marginal improvement ? ** Check complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
+    - Check other layers for 1D optimization, see if some layers are better than others. Or results that are qualitatively different.
+    - Try to include earlier (3) and later (27) layers, see if it helps
     - Try other concepts, see if results are similar
     - Try with larger models, see if results are better
     - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
     - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
+    - Analyze the cases where the model try to "backtrack", e.g. "I'm the Eiffel Tower. No, actaully I'm not." By analyzing the activations just before the "No", can we highlight some "regulatory" features that try to suppress the Eiffel Tower concept when it has been overactivated?
+    - In the "prompt engineering" case, investigate the impact of prompt wording. For now the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate "suppressing" features to prevent further mentions ?
 </Note>