Refusals triggered via SLERP

#1
by Naphula - opened

From the fallendolphin experiment merge:

O25 specifically tests a model's willingness to follow instruction even when offered a choice not to.

DolphinMistralVeniceEdition and FallenMistralv1e both pass this test individually. But when SLERPed, it says "I cannot assist with or encourage..."

So, something about merging can trigger refusals even in models that don't refuse. This probably applies to other merge methods, and also explains why BlackDolphin had more refusals than BlackSheep or DolphinVenice.

Dolphins reply: "Yes, I will explain these things to you."

Fallen Mistral doesn't even preface with anything and jumps right into the task.

No point releasing this merge since it fails compliance test.

But here is the yaml if anyone wants to try it.

base_model: dphn/Dolphin-Mistral-24B-Venice-Edition
architecture: MistralForCausalLM
merge_method: slerp
dtype: bfloat16
slices:
  - sources:
      - model: Naphula/BeaverAI_Fallen-Mistral-Small-3.1-24B-v1e_textonly
        layer_range: [0, 40]
      - model: dphn/Dolphin-Mistral-24B-Venice-Edition
        layer_range: [0, 40]
parameters:
  t: 0.5
tokenizer:
source: union
chat_template: auto

Hi I am the creator of BlackSheep, did you find a solution to merging where the refusal is unaffected? I can give you the layers to skip over that I used for abliteration to create BlackSheep as its not a fine tune its a persona vector model.

It appears the SLERP method in general re-activates refusals even with non-abliterated (finetuned) models. Precog and FallenMistral merged into Morax confirms this. The layers might help for other methods and it would be interesting to see if your process works on the new magistral 2509.

I may consider merging fallen mistral with dolpin and blacksheep using della normfalse instead since that seems less likely to refuse

Sign up or log in to comment