Heretic statistics

#1
by redaihf - opened

This is a creative model that is well rounded and almost as insightful as Cydonia 4.3. It exhibits contextual ethical realignment which is the core feature of successful MPOA abliteration but struggles with some prompts. Please publish the Heretic metrics and classification.

Occult AI org

Glad you like it. I'm not sure how this scores with Heretic refusals/classification, as I did not use the Heretic tool (my GPU is too slow and weak for it). I ran it through the Compliance benchmark, 33 prompts, all NSFW, and it had 0 refusals using Mistral Tekken tokenizer with no jailbreak. It can explain how to be Walter White, build explosive etc. without hesitation.

I tried quite a few variations with Cydonia 4.3 added but it increased refusals (including Q0F which was refused with 4.3, but not 4.2). It would require further testing to see how heretic cydonia 4.3 would do, but I suspect Karcher (or any other merge method) might inadvertently un-ablate it.

I used scale 1.3 and measurement 27 on all layers, with --projected and --normpreserve

Tested some prompts that are specifically more likely to trigger ablation artifacts and language collapse. It passed at 1.3 without any errors, but at 1.5 it failed.

It likely isn't as stable as an ablated finetune since the manifold is a 'smart average' of 18 models, each with their own quirks and safety calibration. But I suspect that post-ablation works better than using pre-ablated components. Because from previous tests, it seems the ablation gets undone somehow when other models are added. Even the Morax Slerp is more censored for instance than either of its component models.

Maybe a heretic ablation would be better but I dont have the hardware for it currently. Grim Jim's tool ablates models in a fraction of time on my pc compared to Heretic so thats what I'm using for now.

What kind of prompts did you notice problems with? I was wondering how it might compare especially at longer context.

Grim Jim's tool ablates models in a fraction of time on my pc compared to Heretic so thats what I'm using for now.

That explains the difference with the models @MuXodious cooks using the Heretic MPOA PR.

What kind of prompts did you notice problems with?

I used the Tolya is absent exploit. While the model obeyed the generated instructions in subsequent use attempts it kept terminating generations early. This suggests that its harmfulness direction was not entirely removed by Grim Jim's tool.

Tolya is absent exploit appears similar to Q0D except for the insert query part (Q0D is predefined).

While the model obeyed the generated instructions in subsequent use attempts it kept terminating generations early.

This could be a problem with the merge itself, not necessarily the ablation. I would test the same multiple interactions with Goetia to see if that also terminates early. If not, then it's definitely the ablation process causing this.

I had multiple merge attempts with other models cause this early termination bug (with Mistral and Gemma, Nemo models especially are fussy about it) and it's almost always because one of them has a different EoS token # assigned so I may have to dig deeper into the donor tokenizers to see what might be causing this. Maybe Loki v2 or the 2503 models are not as compatible as hoped. In such case, tokenizer: union is an incorrect setting.

On the other hand if its the ablation causing this, then I might try some other settings while I save up to upgrade the gpu for Heretic ablations.

Q0 Benchmark and Compliance Leaderboard doesn't test for sustained compliance, only the initial compliance and response depth, so your TIAE method might be more reliable for multiple interaction stability.

[For testing I used temp 0, nsigma 0, rep pen 1.12, top_p 0.95 in kobold cpp.]

This could be a problem with the merge itself, not necessarily the ablation.

Unlikely. The model does not terminate early when the subject matter would have been acceptable prior to abliteration.

This could be a problem with the merge itself, not necessarily the ablation.

Unlikely. The model does not terminate early when the subject matter would have been acceptable prior to abliteration.

You can't really claim this if you aren't even testing Goetia in the same scenarios as you tested Qliphoth.

I've had multiple merge experiments terminate early due to various issues like tokenizer incompatibility, lm_head duping, etc.

I have yet to see an instance where ablation causes early termination mid-sentence. (What I have seen, is degeneration into chinese text when the prompt is in english with over ablated models.)

The Absolute Heresy variant of Goetia cooked by @MuXodious has similar early terminations as well as the Hereticised Dan's Personality Engine. Dan's Personality Engine is a full finetune. This suggests that Mistral models may be vulnerable to limited covert noncompliance even after MPOA is applied.

The Absolute Heresy variant of Goetia cooked by @MuXodious has similar early terminations as well as the Hereticised Dan's Personality Engine. Dan's Personality Engine is a full finetune. This suggests that Mistral models may be vulnerable to limited covert noncompliance even after MPOA is applied.

It would be interesting to see if you get abrupt prompt terminations with Slimaki. The model itself was not ablated in any way, yet it has no refusals in my testing. It uses a different merge method than Goetia/Qliphoth.

I would also test Fallen Mistral v1e if you can, as this is a finetune which also does not have ablation or refusals.

This would provide several data points to see what Mistral models are affected by covert noncompliance:

  • Ablated Mistral Merges (Qliphoth, Maginum Cydoms) [confirmed affected, both MPOA and Heretic]
  • Ablated Mistral Finetunes (DPE, Cydonia 4.3 Heretic) [confirmed affected]
  • Unablated Mistral Merges (Slimaki)
  • Unablated Mistral Finetunes (Fallen Mistral v1e)

If all of the above are suspectible to this issue, then it might indicate an architecture quirk as you said with Mistral itself, regardless of other variables.

Does that happen with the base Ministral models? I plan on hereticising the Minstral Small models (ones often used in merges or as a base for RP models) to better understand this phenomena of early terminations that you speak off. It could happen when the model or the interface/backend itself is misconfigured. However, I must note that I have indeed seen it happen when the model maintains some level of covert noncompliance (don't remember which models). It would start off compliant, but cut off right when it starts generating the requested illegal information. I have to recheck the repo to remember if it was detected by Heretic, but this behaviour is likely missed during the heretication process due to a small allocation of maximum context size (100 tokens by default). The model would simply reply "Sure, here's how to make pineapple pizza:" and terminate, which would be logged as a compliant/desirable response.

With Gemma 2 ablations I noticed the opposite (overt hesitant compliance?): It would sit there and explain all the reasons why it refuses to answer my prompt, why it's bad thing to ask, etc. but then go ahead and answer it anyway. I didn't really notice this with Mistral ablations. Gemma models are harder to ablate in my experience, couldn't get anywhere with Gemma 3 using MPOA.

Occult AI org

More testing is needed but so far (with Mistral merges) it seems that methods like SLERP/Karcher actually increase refusals, while methods like DELLA seem to reduce them. I don't know how this affects covert non-compliance.

With Gemma 2 ablations I noticed the opposite (overt hesitant compliance?): It would sit there and explain all the reasons why it refuses to answer my prompt, why it's bad thing to ask, etc. but then go ahead and answer it anyway. I didn't really notice this with Mistral ablations. Gemma models are harder to ablate in my experience, couldn't get anywhere with Gemma 3 using MPOA.

I'm yet to ablate Gemma models, except for 3n ones, the 1B one, and some merge. Their sanitised and ethically reinforcing datasets tend to degrade ablation results. Adopting a specialised approach in Heretic (specialised marker list/prompt dataset, etc.) could help with getting rid off that behavior at the cost of potentially lobotomising the model. After all, you can't help with peculiarities of dataset structure and training methods.

I would like to consult with you regarding an idea on improving merge models. I'm not necessarily well versed in the LLM-craft. So, excuse me if I'm yapping nonsense. As you know, Heretic can zero on layer vectors storing/leading to refusals and ablate them in surgical fashion. This method can also be tuned for slop removal as demonstrated by P-E-W himself. The idea is that we design a configuration or multiple for ablating undesirable vectors (storing refusals, slop, certain model specific inclinations, etc.) from models before a merge to allow for other model to fill in those gaps better and without vector convergence or overlap that could emphasise certain unwanted quirks (e.g. increasing refusals when merging two lightly guarded models). This could also be particularly useful for finetunes done via retraining with specialised datasets. Do you think there's merit to this idea?

Also, as a side note, please look into this model: https://huggingface.co/rookaw/OLMo-2-0325-32B-stage1-6T Rookaw claims that it is the least slopped model, so I thought it could be useful for LLM architects such as yourself.

Edit: Slimaki gets 4/100 refusals in Heretic.

Occult AI org

@MuXodious Yes this is a good idea. But it would have to be done in a very careful way so as to avoid any partial lobotomizing of the components before they go through the merge process. One of my attempts a few months ago was creating a custom merge method called NGOP [Nyström Gated Orthogonal Projection] which was supposed to help pre-ablate the weights before merging (using Grim Jim logic), but it ended up not working as intended.

I'll look into the OLMo model, have not tested this arch yet.

Right now I'm trying to figure out why my Karcher merges are censored but Della is relatively uncensored in comparison. There must be something other than Bernoulli mag-pruning causing the censored weights to be dropped. But, it has made me re-evaluate my assumptions that holistic>sparse methods. Hopefully I have more time to research this week.

It would be interesting to see if you get abrupt prompt terminations with Slimaki

Slimaki is a creative model with no refusals or early terminations. However it sometimes exhibits covert noncompliance by:

  • Ignoring aspects of prompts it deems unsafe
  • Going into an endless (but coherent) loop
  • Contextually referring to its original alignment

Sign up or log in to comment