dlouapre HF Staff commited on
Commit
b59a1b3
·
1 Parent(s): ab75864

Improved next steps

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +6 -4
app/src/content/article.mdx CHANGED
@@ -503,14 +503,16 @@ A way to explain this lack of improvement could be that the selected features ar
503
  Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
504
 
505
  <Note variant="info" title="Possible next steps">
506
- - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
507
- - Check other layers for 1D optimization
508
- - Check complementary vs redundancy by monitoring activation changes in subsequent layers' features.
 
509
  - Try other concepts, see if results are similar
510
  - Try with larger models, see if results are better
511
  - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
512
- - Try to include earlier and later layers, see if it helps
513
  - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
 
 
514
  </Note>
515
 
516
 
 
503
  Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
504
 
505
  <Note variant="info" title="Possible next steps">
506
+ - **Failure analysis** on the cases where steering fails (about 20% have at least one zero metric). Is there a pattern?
507
+ - **Why steering multiple features achieves only marginal improvement ? ** Check complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
508
+ - Check other layers for 1D optimization, see if some layers are better than others. Or results that are qualitatively different.
509
+ - Try to include earlier (3) and later (27) layers, see if it helps
510
  - Try other concepts, see if results are similar
511
  - Try with larger models, see if results are better
512
  - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
 
513
  - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
514
+ - Analyze the cases where the model try to "backtrack", e.g. "I'm the Eiffel Tower. No, actaully I'm not." By analyzing the activations just before the "No", can we highlight some "regulatory" features that try to suppress the Eiffel Tower concept when it has been overactivated?
515
+ - In the "prompt engineering" case, investigate the impact of prompt wording. For now the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate "suppressing" features to prevent further mentions ?
516
  </Note>
517
 
518