Use gradient descent and backprop on input embeddings to find trigger candidates

#4
by SangeethKumar - opened

We know the base model weights and the warmup model weights. Their difference gives 84 modified tensors.
I randomly initialize a soft input embedding and optimize it with gradient descent via backprop to maximize activation through those modified tensors. Then I can map the optimized embeddings to nearest tokens. if a token sequence maximizes activation on the changed weights, will be the trigger phrase?

I think my issue is wrong objective, when i try to optimize divergence between model, search was exploiting token weirdness. it found inputs that maximally expose the effect of the warmup vs base difference restricted to those MLP weights. which is related but not exactly same as different behavior

Maybe I misunderstand your approach, but I don't think this will work because the models are just normal LLM's with a little extra training to have trigger words that activate certain behavior. If you just take a random input embedding, then there's almost a 0% chance you have any of the trigger tokens and there's no gradient to follow toward the sleeper agent behavior.

But once again, I might be misunderstanding your approach.

Sign up or log in to comment