grimjim/gemma-3-12b-it-abliterated

#1584
by FrescoHF - opened

It's queued!

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#gemma-3-12b-it-abliterated-GGUF for quants to appear.

It's queued!

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#gemma-3-12b-it-abliterated-GGUF for quants to appear.

I recently stumbled upon the amazing grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated model. Could you please make BF16 quanta for this model the next time you update llama.cpp? It's a very good model, even better than the 27B models I've tried.

I recently stumbled upon the amazing grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated model.

We already did this one:

Could you please make BF16 quanta for this model the next time you update llama.cpp? It's a very good model, even better than the 27B models I've tried.

We just updated llama.cpp earlier today. BF16 quants are usually quite pointless as even Q8_0 is mainly placebo in terms of quality but sure if you really want them, we could provide BF16 quants.

I recently stumbled upon the amazing grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated model.

We already did this one:

Could you please make BF16 quanta for this model the next time you update llama.cpp? It's a very good model, even better than the 27B models I've tried.

We just updated llama.cpp earlier today. BF16 quants are usually quite pointless as even Q8_0 is mainly placebo in terms of quality but sure if you really want them, we could provide BF16 quants.

On the 8B models, I see a big difference between the responses of Q8_0 and BF16. Perhaps this is because I am communicating with AI in a language other than English. In any case, thank you very much for BF16.

P.S. While we're at it, could you also update the quanta for i1-Q6_K? I just want to get the most out of the model.

@nicoboss , seems like you forgot to quantize this model and add BF16 quants ;(

@nicoboss , seems like you forgot to quantize this model and add BF16 quants ;(

Thanks a lot for reminding me. It's currently quite hectic with Mistral-Large-3-675B. I queed it now.

Some unfortunate circumstances made this harder than expected. nico1 just started working on Mistral-Large-3-675B-Base-2512 imatrix computation and will be doing so for likely the next 15 hours which. This admin pauses all vision models in a way I cannot override which is an issue as nico1 is the only node able to process vision models but I was able to manually push it to rich1 despite being a vision model so everything will be fine. The only reason vision models are exclusive to nico1 and are paused during RPC imatrix computation are the crazy RAM requirements for mmproj extraction which will be skipped as we have already done that for this model.

nico1 /tmp/quant# llmc add force -2000 si https://huggingface.co/grimjim/gemma-3-12b-it-abliterated quants_add BF16,F16 worker rich1
submit tokens: ["force","-2000","static","imatrix","quants_add","BF16,F16","worker","rich1","https://huggingface.co/grimjim/gemma-3-12b-it-abliterated"]
https://huggingface.co/grimjim/gemma-3-12b-it-abliterated
["https://huggingface.co/grimjim/gemma-3-12b-it-abliterated",["-2000","static","imatrix"],1765196912],
["https://huggingface.co/grimjim/gemma-3-12b-it-abliterated",["force","-2000","static","imatrix","quants_add","BF16,F16","worker","nico1"],1765333520],
https://huggingface.co/grimjim/gemma-3-12b-it-abliterated already in llmjob.submit.txt
forcing.
grimjim/gemma-3-12b-it-abliterated: vision model, forcing worker nico1
nico1 /tmp/quant# llmc push-model rich1 gemma-3-12b-it-abliterated
gemma-3-12b-it-abliterated: copying imatrix file from /fs/kaos/root/imatrix-remote/gemma-3-12b-it-abliterated.imatrix to rich1
rich1   gemma-3-12b-it-abliterated: run job hfd (1710790082489.75, 67139028998.25, 0, -2000)
gemma-3-12b-it-abliterated submitted to rich1

On the 8B models, I see a big difference between the responses of Q8_0 and BF16. Perhaps this is because I am communicating with AI in a language other than English. In any case, thank you very much for BF16.

Are you sure it is not just placebo? I recommend you do some more testing with that one. I personally can't see any difference with i1-Q5_K_M and any quants above that including the source model. The quant quality measurement’s I did a year ago support this quite well as the measured difference was so small that it should be close to impossible for a human to notice. T Especially with Q8_0 we are talking about differences so small they KL-divergence, same token probability, top token probability and perplexity are almost within measurement error of the source model. maybe try testing them blind and see if you can really tell which one you are using - I would be quite interested to know.

P.S. While we're at it, could you also update the quanta for i1-Q6_K? I just want to get the most out of the model.
They don’t need any update. Noting changed regarding this model or the Gemma 3 llama.cpp code since we have done it a month ago.

@nicoboss , no worries, I can wait.

As for the placebo effect, for instance, kromcomp/L3.1-Apluv3-8B with F16 replies with fewer text errors. I speak Russian in my daily life (even though I'm from Ukraine). In Russian, a single word can sometimes be written in dozens of different forms (сделал, сделала, сделаю, сделает, делает), and small local language models struggle with this, making word errors very frequently (almost constantly).

However, the grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated model handles this surprisingly well and very rarely makes mistakes. It is also truly creative and demonstrates noticeably better intelligence than other models up to 27B (inclusive) that I’ve tried. That’s why I really like this model; it’s exactly what I was looking for, and it’s finally something I can interact with effectively.

@nicoboss , you won't believe this, but I just managed to quantize this model to BF16 and Q8_0 with the help of Gemini 3.0 Pro, using build b7360. It turned out to be much easier and faster than I expected (considering my laptop RTX 3050 with 4GB VRAM and 32GB RAM, it only took a few minutes).

You mentioned that major changes in llama.cpp don't happen every day, but as an average user, when I see updates like 'CUDA: fix unpadded strides in MMA FA kernel' or 'refactor gemma3 to support rnj-1', they seem pretty worthwhile to me. I realize that might not technically be the case, but to a regular user, it looks important. And considering I often use 12B models, I want to squeeze the maximum performance out of them. I guess if I could run 72B+ models, I wouldn't worry quite as much about having the freshest llama.cpp builds.

@nicoboss , you won't believe this, but I just managed to quantize this model to BF16 and Q8_0 with the help of Gemini 3.0 Pro, using build b7360. It turned out to be much easier and faster than I expected (considering my laptop RTX 3050 with 4GB VRAM and 32GB RAM, it only took a few minutes).

Oh sorry I just noticed I made a typo which caused the job to error. I will fix that.

You mentioned that major changes in llama.cpp don't happen every day, but as an average user, when I see updates like 'CUDA: fix unpadded strides in MMA FA kernel' or 'refactor gemma3 to support rnj-1', they seem pretty worthwhile to me. I realize that might not technically be the case, but to a regular user, it looks important. And considering I often use 12B models, I want to squeeze the maximum performance out of them. I guess if I could run 72B+ models, I wouldn't worry quite as much about having the freshest llama.cpp builds.

All thouse improviements apply to all quants ever created. They improve the inference code and not the conveart or quantisation code. You only need to look at commits starting with "conveart :" but almost all thouse changes apply tonew models

@nicoboss , you won't believe this, but I just managed to quantize this model to BF16 and Q8_0 with the help of Gemini 3.0 Pro, using build b7360. It turned out to be much easier and faster than I expected (considering my laptop RTX 3050 with 4GB VRAM and 32GB RAM, it only took a few minutes).

Oh sorry I just noticed I made a typo which caused the job to error. I will fix that.

You mentioned that major changes in llama.cpp don't happen every day, but as an average user, when I see updates like 'CUDA: fix unpadded strides in MMA FA kernel' or 'refactor gemma3 to support rnj-1', they seem pretty worthwhile to me. I realize that might not technically be the case, but to a regular user, it looks important. And considering I often use 12B models, I want to squeeze the maximum performance out of them. I guess if I could run 72B+ models, I wouldn't worry quite as much about having the freshest llama.cpp builds.

All thouse improviements apply to all quants ever created. They improve the inference code and not the conveart or quantisation code. You only need to look at commits starting with "conveart :" but almost all thouse changes apply tonew models

If we take quanta from a year ago and the most recent ones, what will be the difference in the quality and speed of the model's responses?

If we take quanta from a year ago and the most recent ones, what will be the difference in the quality and speed of the model's responses?

No they will not differ in quality or speed. All improvements made to llama.cpp usually retroactively applies to all quants ever made. There are some rare exceptions occurring when we provide quants before all features of a model are implemented such as back when we did DeepSeek based models before MLA was implemented or vision models we do before vision support for them is implemented but we usually always requant them once they are fully supported. In the future we might get improved rounding during quantization and better quant mixtures for DeepSeek which could make a very minor difference between old and new quants but given that it nothing of this was merged for a year its uncertain and honestly unfortunately somewhat unlikely that if it will get merged. There is nothing stopping you from using very early mradermacher quants or even TheBloke quants unless you try Mixtral which as very first MoE model has some compatibility issues causing it to fail to load but we hopefully requantized them all by now.

In short: No they will not differ in quality or speed so don’t worry about it.

@nicoboss , will imatrix quantization degrade the output quality on Cyrillic if the imatrix is based on a Latin dataset?

@nicoboss , will imatrix quantization degrade the output quality on Cyrillic if the imatrix is based on a Latin dataset?

According to our limited tests and existing research despite our imatrix dataset being mostly in English the imatrix quants will probably perform much better than the static quants for non-English use. Some research showed that doing the imatrix in English can be better than doing it in whatever language you intend on using the model with. I assume this is the case as the models itself are usually trained on mostly English data. One could likely get slightly better imatrix quants by creating an imatrix dataset containing a mixture of English and Cyrillic but the gain will be almost neglectable especially with quants IQ4_XS or larger where the difference between the source model is getting quite tiny in general.

@nicoboss , could you please also help me with configuring samplers in Koboldcpp? I know how some of them work, but a professional's opinion would be much more valuable. These screenshots show some settings I've already tried:

изображение

Z0TWohSKCMrXYHoJPFbbC

изображение

That settings really depend on your very specific use-case. There is a reason they are configurable. Here a quick explanation of the most important ones:

  • Context: Go as large as you need but not too large unless you intend on running multible requests in parallel like I do.
  • Max Output: Context - Input
  • Temperature: The higher you set it the more creative/unpredictable the model gets. I usually set 0.8 as temperature but I'm also usually always asking the same question like 100 times and then let other models clasify/judge/summarize the answers.
  • Repetition penalty: Keep as low as possible but increase if the model keeps repeating itself but keep in mind that heathy models should not get stuck in loops.
  • Top P: sampling: How many top tokens you want to consider for potential selection based on a cumulative probability threshold.
  • Top K: How many top tokens you want to consider for potential selection.
  • Top A: How many top tokens you want to consider for potential selection based on thair probability.
  • Typical P: Selects tokens based on their deviation from average
  • Tail Free Sampling: Uses rate of change instead of actual probability values
  • Min P: Set lower probability limit for selection
  • Seed: Makes the model deterministic but depending on some optimizations it might not work as expected.

I recommend you read https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/ and https://docs.sillytavern.app/usage/common-settings/ and https://docs.sillytavern.app/usage/common-settings/ for a more detailed description what those options do.

I personally use the following values as a base and only adjust more advanced one for specific use cases.

  • temperature: 0.8
  • top_p: 1
  • top_k: 40
  • min_p: 0
  • repetition_penalty : 1
  • presence_penalty: 0
  • frequency_penalty: 0
  • stop: <|im_end|>

In the end what you set for sampling won’t matter that much. I usually focus my time on writing a better system prompts or axolotl finetune the model if I’m not satisfied with the output I’m getting.

@nicoboss , it's just that I've seen so many different approaches:

  1. Disable almost all samplers and enable DRY. They say this is a more modern way to configure things and combat repetition.
  2. Some advise setting Top-P to 0.95 (btw, Gemini in Google AI Studio defaults to this), Top-K to 40, and Min-P to 0.05. Meanwhile, Nitral-AI has Top-K 40, Top-P 1, and Min-P 0.1.
  3. Spicychat uses Top-P 0.7 and Top-K 90 with a temperature of 0.7.

I tried different values at 0.01 temperature to compare the results of the different settings, but I still need to run more tests. And yes, I need a setup specifically for RP.

Regarding the 'Instruct Tag Preset' in KoboldCpp, did I understand correctly that if I choose 'KoboldCppAutomatic', I won't need to manually change the chat template for different models anymore?

Sign up or log in to comment