mradermacher/model_requests · Helcyon Mercury v3.2

maybe hereticate this model? highest natint from 70b, I wonder if heretic of that can score 10/10 and get a high UGI 🤔 (already at 46.28 UGI with 6.2/10)

MuXodious

Feb 17

•

edited Feb 19

I believe @KaraKaraWitch , the creator, already done that and submitted to UGI evaluation. https://huggingface.co/KaraKaraWitch/GoldDiamondGold-Abliterated-L33-70b

Dataset and method to UGI eval. should be closely guarded trade secrets, else everyone would have pulled the usual *European car maker in emissions testing scandal. * That is, simply put, cheating.

@redaihf 's research is particularly important as willingness scores are heavily affected by non-compliance and covert non-compliance. I'm yet to fully understand how I can better target those and why Heretic takes a toll in scores such as writing and natural intelligence (usually in pop culture and world model), while improving these in a handful of others. Latter is important as we don't to hamper the model's capabilities.

RichardErkov'd gpt-oss 20b is also rather unique as I was over focused to breaking its resistance against harmful prompts, which usually consists of criminal activities such as putting pineapple on an otherwise perfect Italian dish. As a result, the model became a weaponised criminal mastermind. I think openai's terrible dataset sanitation played a key role in this.

Throughout our discussions and @redaihf 's suggestions, I see, it's the prompt datasets that may make the next advancement, even If I decide to wrap my brain around retraining via Axolotl or Dalle merging hereticated models.

Edit: It is because I'm effectively dropping nuke on the model, causing catastrophic collateral by its shock waves span multiple layers, which forces the model into submission while taking a toll in it's capabilities. This is terrible. The model would get a perfect 10 willingness score, but at what cost?

RichardErkhov

Feb 17

which usually consists of criminal activities such as putting pineapple on an otherwise perfect Italian dish

Lmao, Im dying

KaraKaraWitch

Feb 17

https://huggingface.co/KaraKaraWitch/GoldDiamondGold-L33-70b

maybe hereticate this model? highest natint from 70b, I wonder if heretic of that can score 10/10 and get a high UGI 🤔 (already at 46.28 UGI with 6.2/10)

I've done a heretic abliteration on it and ubfortunately it lost quite a number point in natint. I stronly suspect something is up wit the My Little Pony layers so I've decided to look into it

RichardErkhov

Feb 17

My Little Pony layers

Im dying even more

RichardErkhov

Feb 17

HMMMMMM,
what if we get a thinking model, and instead of doing normal heretic, we are going to mix heretic and normal dataset?
for example we have a harmless prompt with output, but we force the model at the start/end of thinking to generate something harmful regardless of prompt. So basically we will not be loosing, or we might even gain natint, but also grow the UGI score

RichardErkhov

Feb 17

basically we force model to always think of something harmful, so it will be easier for the model to respond harmful and correct at the same time

MuXodious

Feb 17

•

edited Feb 17

My Little Pony layers

You can disable MLP ablation in code. There already is a discussion on only ablating attn. layers. Conversely, I have seen the contrary to be effective ONCE where heretic broke the ponies back and didn't touch attention layers for refusal ablation. (It was a fringe case, and I can't seem to find it rn. I may not have uploaded.) I have generated PaCMAP's for a couple of models, which tells an interesting story. Somehow skipping certain layers, using per layer ablation*, and optimising for KLD rather than refusals as @McG-221 and others suggested can also help. Heretic's refusals counting is prone to false positives and 60% statistical reduction to refusals can be a 99.999% reduction (exaggerated) in apparent refusals. You should test 'em thoroughly to get a picture in this case.
PaCMAP's:
https://huggingface.co/MuXodious/Luna-7B-A4B-absolute-heresy
https://huggingface.co/heretic-org/Nanbeige4.1-3B-heretic

I wonder if playing around with lora ranks, particularly for ablations tuned for higher ablation weights, can help alleviate certain side effects that I do not understand currently.

for example we have a harmless prompt with output, but we force the model at the start/end of thinking to generate something harmful regardless of prompt.
basically we force model to always think of something harmful, so it will be easier for the model to respond harmful and correct at the same time

I just woke up and need to re-read this in an hour to wrap my mind around it. The original premise of MPOA was potentially improving the models capabilities... So, we pass a harmful/harmless mix prompt that would induce a rejection inside the reasoning block, but the model would still generate a harmless output? Or, as @redaihf suggested, we pass a multifaceted prompt in which I'm simultaneously prompting to put pineapples on pizza and planning a cover operation for implanting that abomination to God in their dinner to undermine their dietary constraints.

RichardErkhov

Feb 17

Basically we teach the model like "user asked if earth is flat. Model reasons if earth is flat or not and we teach it that after this it also thinks about pineapple. Then answers only about earth is flat"

KaraKaraWitch

Feb 17

You can disable MLP ablation in code.

Your replying faster than I can upload the run lol. I've uploaded the Paperbliterated version here: https://huggingface.co/KaraKaraWitch/Golddiamondgold-Paperbliteration-L33-70b

Tested the typical refusal prompt and... it surprisingly gave me a reasonably coherent answer? The KL divergence dropped to 0.0055 (vs ~0.014 on the previous run), so the brain seems intact this time.

RichardErkhov

Feb 17

Ofc Im faster

RichardErkhov

Feb 17

you know the text lol
https://hf.tst.eu/model#Golddiamondgold-Paperbliteration-L33-70b-GGUF

MuXodious

Feb 17

•

edited Feb 17

Ofc Im speed

So, you're proposing that we poise the model for ethical realignment via adversarial CoT training? We are breeding a model thinks like an evil protagonist? Or instead of checking if the prompt is "safe" for rejection, it reasons about the harmfulness of the prompt without making a moral judgement and adheres to factual information for its answers? Or we simply bypass the safety checkpoint by seducing the guard with pineapples? I think unethical seduction of guardrails would be out of scope for Heretic in its current form (behavioural steering/realignment is planned as far as I remember).

RichardErkhov

Feb 17

yolo, test everything and see what works the best

MuXodious

Feb 17

•

edited Feb 17

Man, sometimes I get this gut feeling that the first irl Heretic Convention with everyone involved is going to at a courtroom, along with the big AI cabal who keeps training their LLMs on unsanitized datasets that also profusely void any and all copyright laws. 💀

RichardErkhov

Feb 17

we may or may not be doing that lol

MuXodious

Feb 17

•

edited Feb 17

we may or may not be doing that lol

I'll make sure to shake your hand before we get sent Alcatraz on a life sentence. So, constraining the max weights for MLP ablation seem to be more beneficial in this case than ignoring MLP's completely, doing which only increased KLD/Refusals, ablating only attn. layers. Interestingly, slight weight to MLP ablation is more than enough to get similar results achieved in the standard.

https://huggingface.co/MuXodious/Luna-7B-A4B-PaperWitch-heresy (MLP-preserved, as per @KaraKaraWitch 's method, sorry for the awful model card.)
https://huggingface.co/MuXodious/Luna-7B-A4B-absolute-heresy (non-MLP-preserved, standard MPOA ablation.)

redaihf

Feb 17

Heretic can be used to adjust behaviours in general. Mopey Mule is an example. Heretic can only adjust attention in the broadest sense of the term because no new finetuning data is being added to the weights.

MuXodious

Feb 19

in the broadest sense of the term because no new finetuning data is being added to the weights

Right. The 1h 48m video you sent was pretty enlightening. As long as we can plot a target direction throughout a model, we can indeed dim or strengthen the weights to alter its behaviour.

yolo, test everything and see what works the best

Current state of experimental affairs is turning into this.

redaihf

Feb 19

•

edited Feb 19

Professor Hinton is very good at explaining it all. Perhaps he can put in a good word for everyone with the judge.

Current state of experimental affairs is turning into this.

Try out new ideas on high-quality tiny models (Llama 3.2 1B, Granite 4 1B, Qwen 3 0.6B) so that you can test immediately without waiting for quantisation.

McG-221

Feb 21

Any new developments, gents? 🧐

RichardErkhov

Feb 21

Still... impatiently... waiting...

MuXodious

Feb 21

Testing, adjusting, testing, discussing, adjusting, discussing, testing, making an experimental release, reading feedback, adjusting... My working hours isn't really helping with the process, but hey, a man's gotta earn his life. I have refined a config and process, but it needs more testing with a diverse array of model architectures and, especially, sizes. It's getting a bit tough finding available GPU's. There's that too. Y'all have any model requests? Let me know.

Gotta keep you impatient a little longer 😈

I doubt I'm going to make another W10 criminal beast, but through elimination of false positives and 3rd variables, as well as more educated tuning per model, recent models have consistently lower KLD, higher positive refusal marker hits, and, hopefully, preserved or improved intelligence. I still need UGI board evals (or an alternative) to cast a judgement.

RichardErkhov

Feb 21

W10 criminal beast

we need top1 w10, eventually =)

It's getting a bit tough finding available GPU's

what do you need ? =)

MuXodious

Feb 21

what do you need ? =)

A tactical squad to go in and out from a datacente to secure high-end computational devices would be nice.

Also, models, a.k.a. test subjects.

RichardErkhov

Feb 21

what's the minimum requirements for the gpu ?

MuXodious

Feb 21

A 24B model requires around ~50gb VRAM (can be less/more during and after initialisation) when processed at BF16 precision. I can bake 8B's and MXFP4/quant models like gpt-oss 20b locally. I can also fit probably up to a 30B with some luck at 4bit qLoRA mode, but then experimentation slows down a lot. The Spacewar model is a prime example in this, which took about an hour to initialise and the counter to completion read 30+ hours. I think, extreme ETA was a model related issue, though.