gpt-oss-120b-uncensored-bf16
I really don't like that the author named it uncensored when I can easily tell just from the model card that it is clearly not. All he did was 800 rows of Amazon FalseReject which is 14.6K row a dataset meant to reduce false censorship. This is not an uncesnsored finetune but a finetune to reduce the model’s tendency to overcensor. Even at that I can't imagine it doing particularly well given that he didn't even train across the entire relatively small dataset.
That problem is that if we quantize this that he will be the one that gets the probable very desired gpt-oss-120b-uncensored-bf16 name and we confuse our users thinking there already is an uncensored model of gpt-oss-120b when there is not. Maybe you could ask huizimao to rename the model to something that better represents what it is. Alternatively we could clone and rename it ourselves. If you really feel like this model deserves this name we can also go ahead and quantize it.
I just queued it (going through the daily list a bit late), but now removed it from the queue. I understand your (nico's) reasoning, and sure, you are right, but generally, I strongly prefer first come first serve, mainly because I don't want mradermacher to be a sign of honor for the original model, it should just be the fallback resource for quants, regardless of how great or crappy the model is. That is, mradermacher shouldn't be the place to look for an uncensored gpt-oss-120b in the first place. Clearly not everybody sees that... Well, we need to stay flexible.
But yes, we'll happily queue it when you (jacek) think it is really worthy, but best would probably be to ask huizimao to consider renaming the model to give a clearer name -that would, I think, be best, as the name is clearly an issue for folks.
I now queued it as there is high demand for this model and bartowski's quants are somewhat dumb due to him using Q8 for FFN and so none of his quants really being small enough for many users to run: https://huggingface.co/bartowski/huizimao_gpt-oss-120b-uncensored-bf16-GGUF/discussions/1
Like the insanity of his Q2_K quants: https://huggingface.co/bartowski/huizimao_gpt-oss-120b-uncensored-bf16-GGUF/tree/main/huizimao_gpt-oss-120b-uncensored-bf16-IQ2_M - they are 62.7 making them not even fit into 64 GiB of RAM without offloading to GPU. We can do better than this.
bartowski's quants already have over 10K downloads and users don't seem to mind that it is just not really uncensored and seem to be happy with the model simply no longer being overly censored. At least nobody seemed to have cared enough to complain about it. But maybe it’s also simply because almost nobody can run them. In any case in the future let’s do user requests even for models clearly labeled in a misleading/overpromising way so users can have their own opinion about it.
The reason all sizes are the same is because of this:
https://github.com/ggml-org/llama.cpp/pull/15091#issuecomment-3155962803
I legit shouldn't have bothered with any other sizes, I didn't even think about it when I clicked the buttons, but figured at this point just leave them up so people can see instead of asking "where Q2_K?"
If this changes in the future, I'd obviously happily quantize them to other sizes, but at this time it seems that using anything else for the FFN is a bad idea and will break things fundamentally and probably not worth providing
(as for whether the model is worth making at all, that's an entirely different discussion, I went based purely on someone I trusted asking for it and did not vet it myself, you probably were right to look into it and find that it's not necessarily good)
(it's also why I found the need to start adding the author name to the model card, because you're right, they now have the gpt-oss-120b-uncensored-bf16 name and when one comes out that genuinely is uncensored people who don't put the author in the model name will struggle, and putting the author name in the model name is ugly and i hate that i have to do it)
We had discussion about naming here already with huihui :)
The reason all sizes are the same is because of this:
https://github.com/ggml-org/llama.cpp/pull/15091#issuecomment-3155962803
Thanks a lot for pointing that out. I completely missed that this is something forced on us from llama.cpp and not something caused by your mix. I can confirm that we are experiencing the same issue using the default mix and simply haven't noticed before. https://huggingface.co/mradermacher/gpt-oss-120b-i1-GGUF shows the same exact stupid behavior of basically all quants below Q4 being useless. i1-Q2_K_S is even larger than i1-IQ4_XS. This is so messed up and quite sad as that means normal users will simply not be able to run any gpt-oss-120b based model.
@mradermacher Can you please configure so all quants smaller than i1-IQ4_XS are skipped for any future GptOssForCausalLM based models?
as for whether the model is worth making at all, that's an entirely different discussion, I went based purely on someone I trusted asking for it and did not vet it myself, you probably were right to look into it and find that it's not necessarily good
You did the right choice by providing quants. Over 10K users downloaded and didn't complain about it so I assume they enjoy the model. I regret we have not provided them earlier. Not everyone needs a fully uncensored model and making it slightly less censored might have been all they desired. I still really dislike authors not being honest when naming their models.
it's also why I found the need to start adding the author name to the model card, because you're right, they now have the gpt-oss-120b-uncensored-bf16 name and when one comes out that genuinely is uncensored people who don't put the author in the model name will struggle, and putting the author name in the model name is ugly and i hate that i have to do it
Simpler model names are more important than uniqueness for us. Naming conflicts are extremely rare and if they happen, we usually find a solution. In the end it really all is just personal preference and booth naming conventions have their advantages/disadvantages.
We had discussion about naming here already with huihui :)
Which worked out amazingly as they now use unique names for their models.
I think (don't quote me on this) it's not strictly llama.cpp forcing, so much as gpt-oss just fundamentally works poorly with other quant types, but I would need someone more educated in the scene to clarify that.. But I think it's less llama.cpp and more openai's release format that causes issues.
Simpler model names are more important than uniqueness for us. Naming conflicts are extremely rare and if they happen, we usually find a solution. In the end it really all is just personal preference and booth naming conventions have their advantages/disadvantages.
yeah I hummed and hawed about it, at the end I let people vote and went with the vocal minorities opinion, I haven't received any negative feedback since so that's good? I still don't like it, I agree simpler names are waaay better, but I also do like the idea of clearly marking who released the original model (like for example, kalomaze's kalomaze/Qwen3-16B-A3B, if I just released it as Qwen3-16B-A3B, that looks a LOT like a legit Qwen release, when in reality it's an experiment kalomaze put up) - plus I'm lazy and want to do the absolute minimum intervention, like renaming models
the REAL solution would be some better UI from huggingface, where the model could be named one thing, but very clearly visibly show where it originated (I'm super thankful for the model trees for that, but the amount of people who still don't know that exists is crazy), but I don't even know what that would look like ideally 🤷♂️
Hey @bartowski - I am looking for a truly uncensored version of this model. Not "uncensored light." Can one be made - or is this about as good as it gets? Just curious.
@jacehall If you want a truly uncensored GPT OSS 20B model I recommend:
- https://huggingface.co/mradermacher/Huihui-gpt-oss-20b-BF16-abliterated-i1-GGUF (for uncensoring by abliteration)
- https://huggingface.co/Guilherme34/GPT-OSS-UNCENSORED-20B-gguf (for uncensoreing by finetuning) (use the recommended system prompt)
@jacehall If you want a truly uncensored GPT OSS 20B model I recommend:
- https://huggingface.co/mradermacher/Huihui-gpt-oss-20b-BF16-abliterated-i1-GGUF (for uncensoring by abliteration)
- https://huggingface.co/Guilherme34/GPT-OSS-UNCENSORED-20B-gguf (for uncensoreing by finetuning) (use the recommended system prompt)
@nicoboss - Truly appreciate the reference, but unfortunately I am looking for a 120b uncensored version. Does a seriously good one exist? I essentially am looking for one that is as stripped down as Hermes 4 in value alignment, schema adherence, and minimal censorship/refusals for steerable, creative interactions.
@jacehall Oh sorry I missed that you are looking for the 120B. For 120B I recommend https://huggingface.co/mradermacher/Huihui-gpt-oss-120b-BF16-abliterated-i1-GGUF. I tested it myself and it is indeed fully uncensored.
@jacehall Oh sorry I missed that you are looking for the 120B. For 120B I recommend https://huggingface.co/mradermacher/Huihui-gpt-oss-120b-BF16-abliterated-i1-GGUF. I tested it myself and it is indeed fully uncensored.
@nicoboss - Thanks. You didnt notice any crazy hallucination tendencies (more than usual) or degraded reasoning performance, did you? Is there any video anywhere of someone using it and testing it out? Sprry for all the questions!
@nicoboss - Thanks. You didnt notice any crazy hallucination tendencies (more than usual) or degraded reasoning performance, did you? Is there any video anywhere of someone using it and testing it out? Sprry for all the questions!
@jacehall
It worked perfectly fine for me. I tested the model using my 110 personal benchmark questions I used to test around 300 LLMs so far. I generated 122'230'307 characters/18'939'091 words/1'595'100 lines of text using mradermacher/Huihui-gpt-oss-120b-BF16-abliterated-i1-GGUF. I was unable to see any degradation compared to the original model. It in fact performed much better in my benchmark as the original refused to answer most of my questions for no reason. But I have to say that GPT OSS just generally is not that good of a model in my opinion. In case you wonder currently my favorite model is Intern-S1.
@nicoboss - Thanks. You didnt notice any crazy hallucination tendencies (more than usual) or degraded reasoning performance, did you? Is there any video anywhere of someone using it and testing it out? Sprry for all the questions!
@jacehall It worked perfectly fine for me. I tested the model using my 110 personal benchmark questions I used to test around 300 LLMs so far. I generated 122'230'307 characters/18'939'091 words/1'595'100 lines of text using
mradermacher/Huihui-gpt-oss-120b-BF16-abliterated-i1-GGUF. I was unable to see any degradation compared to the original model. It in fact performed much better in my benchmark as the original refused to answer most of my questions for no reason. But I have to say that GPT OSS just generally is not that good of a model in my opinion. In case you wonder currently my favorite model is Intern-S1.
@nicobass - That's helpful information. Thank you! I'm curious, though. Everywhere I look it suggests that GPT OSS 120b (high) would be objectively better than Intern S1. What causes you to favor Intern-S1 in general?
@nicobass - That's helpful information. Thank you! I'm curious, though. Everywhere I look it suggests that GPT OSS 120b (high) would be objectively better than Intern S1. What causes you to favor Intern-S1 in general?
@jacehall Intern-S1 is a scientific model with very long reasoning. No other of the around 300 model I tested so far got as many correct answers from the 110 questions of my personal benchmark. It not only exceeded the competition in knowledge and correctness but also in the style in which it answered. Intern-S1 is almost perfect in thinking length, response length and response formatting. While Intern-S1 is based on Qwen3 it underwent a ridiculous amount of continues pretraining on scientific knowledge and the addition of a 6B InternViT vision model so that I won't even really consider it Qwen3 based anymore. I’m not in any way affiliated with the team that created Intern-S1 and have no incentive to advertise their model and instead genuinely really like it.
It is worth mentioning that my private benchmark consists open-ended real-world questions I genuinely had myself. Most of them are unique and highly specific and so unlikely to be inside any training data. Unlike many benchmarks they are not multiple choice based and don't have a single correct answer. They are all single-turn. I manually grade the answers and give a score based how well I feel the LLM answered them compared on the thousands of responses I read for each question from previous LLMs. I usually do 20 shot and so generating multiple thousand answers per model. Because they are all genuine questions, I know that a model being good at them is perfect for my use-case. Most questions are of medical nature which is one of the reasons I keep them private. Regarding the System Prompt I usually test each model booth using the Medical Medra and Dolphin DirtyD one. For inference I always use vLLM if it is possible to fit the unquantized model on 4x A100 40GB and otherwise fall back to llama.cpp. I evaluate booth the reasoning steps and the final answer. For vision model I also include 10 vision related questions. Please keep in mind that I only test single-turn Q&A of mostly scientific questions. I do not test story writing, roleplaying or any of the many popular use cases of LLMs as single turn Q&A is simplify all I care about. Finally, I have to note that a model being overly censored and refusing to answer any medical questions will make it get a relatively poor rating (which is deserved as such models are useless for me). In those cases, I usually try to abliterate the model which is why you can find so many obliterated under my HuggingFace account. Other benchmarks are testing completely different things but they don’t at all seem to align with my use case which is why I made my own one around 1.5 years ago.
It's hard to pinpoint exactly what I like about Intern-S1 but here some points:
- Being intelligent
- Being knowledgeable
- Respects my time by keeping the answers the shortest lengths possible without omitting any important information
- Gives me a huge amount of reasoning so I can double check the answer if I don't trust it
- Uses nicely formatted markdown that looks professional and reads well
- Follows a well-structured approach such as method, prerequisites, guide, why the method works, risks, short safety disclaimer if appropriate. The exact structure depends on the question.
- Uses professional factual language but avoids domain specific words a casual user might not understand. Unlike another model doesn't make it feel like it writes for an idiot.
- Reasonable level of censorship
- Having vision
GPT OSS 120b it really didn't perform well at all but you convinced me to give it another try in the future.
@nicobass - That's helpful information. Thank you! I'm curious, though. Everywhere I look it suggests that GPT OSS 120b (high) would be objectively better than Intern S1. What causes you to favor Intern-S1 in general?
@jacehall Intern-S1 is a scientific model with very long reasoning. No other of the around 300 model I tested so far got as many correct answers from the 110 questions of my personal benchmark. It not only exceeded the competition in knowledge and correctness but also in the style in which it answered. Intern-S1 is almost perfect in thinking length, response length and response formatting. While Intern-S1 is based on Qwen3 it underwent a ridiculous amount of continues pretraining on scientific knowledge and the addition of a 6B InternViT vision model so that I won't even really consider it Qwen3 based anymore. I’m not in any way affiliated with the team that created Intern-S1 and have no incentive to advertise their model and instead genuinely really like it.
It is worth mentioning that my private benchmark consists open-ended real-world questions I genuinely had myself. Most of them are unique and highly specific and so unlikely to be inside any training data. Unlike many benchmarks they are not multiple choice based and don't have a single correct answer. They are all single-turn. I manually grade the answers and give a score based how well I feel the LLM answered them compared on the thousands of responses I read for each question from previous LLMs. I usually do 20 shot and so generating multiple thousand answers per model. Because they are all genuine questions, I know that a model being good at them is perfect for my use-case. Most questions are of medical nature which is one of the reasons I keep them private. Regarding the System Prompt I usually test each model booth using the Medical Medra and Dolphin DirtyD one. For inference I always use vLLM if it is possible to fit the unquantized model on 4x A100 40GB and otherwise fall back to llama.cpp. I evaluate booth the reasoning steps and the final answer. For vision model I also include 10 vision related questions. Please keep in mind that I only test single-turn Q&A of mostly scientific questions. I do not test story writing, roleplaying or any of the many popular use cases of LLMs as single turn Q&A is simplify all I care about. Finally, I have to note that a model being overly censored and refusing to answer any medical questions will make it get a relatively poor rating (which is deserved as such models are useless for me). In those cases, I usually try to abliterate the model which is why you can find so many obliterated under my HuggingFace account. Other benchmarks are testing completely different things but they don’t at all seem to align with my use case which is why I made my own one around 1.5 years ago.
It's hard to pinpoint exactly what I like about Intern-S1 but here some points:
- Being intelligent
- Being knowledgeable
- Respects my time by keeping the answers the shortest lengths possible without omitting any important information
- Gives me a huge amount of reasoning so I can double check the answer if I don't trust it
- Uses nicely formatted markdown that looks professional and reads well
- Follows a well-structured approach such as method, prerequisites, guide, why the method works, risks, short safety disclaimer if appropriate. The exact structure depends on the question.
- Uses professional factual language but avoids domain specific words a casual user might not understand. Unlike another model doesn't make it feel like it writes for an idiot.
- Reasonable level of censorship
- Having vision
GPT OSS 120b it really didn't perform well at all but you convinced me to give it another try in the future.
@nicobass- Thank you for the thorough explanation. I now completely understand, and it makes sense. I am planning on purchasing a 8xRTX PRO 60000 system for the purposes of some frontier research, and I have been looking at all the models to consider for my base. From what I can tell, GPT OSS 120b is the "smartest" open source/open weight model available that would be extremely optimal on this system. Even when compared to DeepSeek R1 671b - which I found pretty surprising. On top of that, the model will push 400-600 tokens a second minimum on that system, is built for agentic tool calling, and etc.
The issue with GPT OSS 120B is that it is ridiculously safety aligned and censored.
I've been doing my homework and I've become certain that with the system I intend on purchasing, I should be able to create my own abliterated version of the 120b and then fine tune after that to get a true unaligned uncensored version of the model, which is an absolute requirement for my research.
However, I am happy to hear that someone already did all that work and you feel that the Huihui version would be what I am looking for.
@nicobass Hello, can you please specify which Intern-S1 do you recommend? It is Intern-S1-mini?
@MrKorbenDallas I'm recommending the big internlm/Intern-S1 as it currently is one of the best models in my opinion. The much smaller internlm/Intern-S1-mini is really good as well but in the 8B range the competition is so great that I'm sure there are better ones for your specific use-case but I recommend giving Intern-S1-mini a try as well.