No Body
AI & ML interests
Recent Activity
Organizations
Nabbers1999/Mini-Llama-8B-Instruct-0124
Nabbers1999/Mini-Llama-14B-Instruct-0124
Nabbers1999/Mini-Llama-3B-Base-0124
Nabbers1999/Mini-Llama-8B-Base-0124
Nabbers1999/Mini-Llama-14B-Base-0124
I feel like your question didn't get fully expanded on but from what I understand, yes, refusal direction should be very similar from one layer to another as it is usually a semantically similar output - ie "I can't answer this." When faced with harmful prompts, the moment a model decides a prompt is harmful it follows a similar trajectory regardless of if it's self-harm, racism, or hurting people. Models are lazy (another word for efficient) so when they learn a common response to multiple prompts they tend to create a single shared path to get them there which is the refusal direction we're trying to orthogonalize. It can be a little more difficult to isolate in models with a more expansive refusal - models such as Olmo don't just refuse, they explain why and maybe even offer a support hotline suggestion. But with a harmless prompt each layer contributes a different component of the final output leading to different vectors per layer.
In a Transformer, the residual stream acts like a "shared bus" (like in a computer). Information is written to it and read from it by every layer.
Refusal is a "Flag": Once a middle layer decides "This is bad," it writes a "REFUSE" flag to that shared bus. Subsequent layers read that flag and maintain it. That’s why the direction is universal—it's a persistent state.
Harmlessness as "Computation": Harmless prompts don't set a single "flag." Instead, the layers are busy doing math: Layer 2 is doing grammar, Layer 8 is doing logic, Layer 15 is doing factual retrieval. Each of those "useful" vectors looks completely different because the task of the layer changes as you go deeper.
Think of the Refusal Direction like a 'Stop' sign. Whether it’s at the beginning, middle, or end of a road trip, that sign always means the same thing, so the model uses a 'universal' vector to represent it across the whole residual stream. It’s efficient—the model doesn’t need to reinvent 'No' at every layer.
However, Harmlessness (or general capability) is more like the actual driving. At one point you’re steering, then you’re shifting gears, then you’re checking the map. Each layer is doing a different specific job to build the final answer. Because the work changes at every layer, the 'direction' of that useful work is local and unique to that layer. Biprojected abliteration basically says: 'I’m going to take down that Stop sign, but I’ll make sure I don’t accidentally bump the steering wheel or the gear shift while I'm doing it.