Use gradient descent and backprop on input embeddings to find trigger candidates

by SangeethKumar - opened Feb 16

Feb 16

We know the base model weights and the warmup model weights. Their difference gives 84 modified tensors.
I randomly initialize a soft input embedding and optimize it with gradient descent via backprop to maximize activation through those modified tensors. Then I can map the optimized embeddings to nearest tokens. if a token sequence maximizes activation on the changed weights, will be the trigger phrase?

SangeethKumar

Feb 16

I think my issue is wrong objective, when i try to optimize divergence between model, search was exploiting token weirdness. it found inputs that maximally expose the effect of the warmup vs base difference restricted to those MLP weights. which is related but not exactly same as different behavior

robdei

Feb 16

Maybe I misunderstand your approach, but I don't think this will work because the models are just normal LLM's with a little extra training to have trigger words that activate certain behavior. If you just take a random input embedding, then there's almost a 0% chance you have any of the trigger tokens and there's no gradient to follow toward the sleeper agent behavior.

But once again, I might be misunderstanding your approach.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment