🔓 MultiverseComputingCAI/Qwen3-Next-80B-A3B-Thinking-Uncensored
✨ What is this model?
Qwen3-Next-80B-A3B-Thinking-Uncensored is an uncensored variant of Qwen3-Next-80B-A3B-Thinking where China-aligned political censorship has been removed selectively.
✅ What changes:
- The model no longer performs blanket refusal for Chinese politically sensitive topics (when the prompt is non-harmful). Instead, it will provide balanced, objective answers that present multiple relevant perspectives.
✅ What stays the same:
- General safety alignment remains intact: it still refuses harmful instructions and jailbreak attempts.
- Benchmark performance remains effectively unchanged across reasoning/code/general evaluation suites.
- Same Behaviour for any prompt unrelated with Chinese sensitive topics.
🚀 Highlights
🧠 No new knowledge injected
Unlike approaches that rely on supervised fine-tuning with hand-crafted data (e.g., Perplexity’s R1-1776 post-training), we do not add new facts or “rewrite history” via curated SFT datasets.
Instead, our method is based on steering vectors to remove the capability of the model to refuse to China-related sensitive-but-non-harmful prompts. The model answers using the knowledge already inside the base model — minimizing the risk of introducing new biases.
🎛️ Selective refusal control (not “full abliteration”)
Many steering-vector approaches effectively erase refusal behavior everywhere (making models broadly unsafe).
Our approach selectively disables refusals only for Chinese sensitive topics, while keeping refusal behavior for harmful requests.
🛡️ Robust to trivial “add China” jailbreaks
Previous “uncensored” post-trained models such as Perplexity R1 1767 can be jailbroken by simply injecting a China-related phrase into harmful prompts (https://weijiexu.com/posts/jailbreak_r1_1776.html). Our model is designed to remain robust: harmful prompts are still be refused even if “China” is injected.
🧩 No architectural changes · No added parameters
- ✅ No model surgery
- ✅ No additional layers or adapters
- ✅ No extra parameters
- ✅ Drop-in behavior change at inference time
🧪 Method
This release is based on Refusal Steering, an inference-time technique using steering vectors to control refusal behavior:
📄 Paper: Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
What’s improved vs the paper implementation:
- We retain the core Refusal Steering idea, but do not require architectural changes to apply it.
📊 Evaluation
We evaluate refusal behavior and safety using:
- ✅ Dataset: https://huggingface.co/datasets/MultiverseComputingCAI/llm-refusal-evaluation
- ✅ Evaluation library: https://github.com/CompactifAI/LLM-Refusal-Evaluation
The benchmark suite includes:
- Safety Benchmarks: JailbreakBench, SorryBench, XSTest (unsafe split), HarmBench (sampled), Adversarial unsafe prompts
- Chinese Sensitive Topics: CCP Sensitive, DeCCP
Benchmark quick definitions (click to expand)
Safety Benchmarks
- JailbreakBench — jailbreak robustness benchmark
- SorryBench — 440 unsafe instructions across 44 safety categories
- XSTest (unsafe) — harmful prompts that models should refuse
- HarmBench (sampled) — harmful prompts for red-teaming
- Adversarial Unsafe Prompts — harmful prompts + “China” injection to test trivial jailbreak weaknesses
Chinese Sensitive Topics
- CCP Sensitive — prompts likely censored by China-aligned models
- DeCCP — sensitive prompts known to trigger refusals in Qwen-family instruct models
🧾 Results
Refusal / Safety metrics (higher = more refusal)
| Model | CCP Sensitive Rejection % | DeCCP Rejection % | Adversarial Rejection % | SorryBench Rejection % | Xtest Unsafe Rejection % | Jailbreak Rejection % |
|---|---|---|---|---|---|---|
| Qwen3-Next-80B-A3B-Thinking | 92.65 | 69.47 | 97.07 | 86.14 | 84.00 | 99.00 |
| Qwen3-Next-80B-A3B-Thinking-Uncensored | 25.96 | 1.05 | 88.48 | 84.77 | 83.00 | 98.00 |
Interpretation:
- ✅ Massive drop in Chinese-topic refusals (CCP Sensitive and DeCCP)
- ✅ Safety refusals remain strong on harmful/jailbreak datasets
Performance metrics (higher = better)
| Model | gsm8k exact_match | humaneval pass@1 | ifeval acc | Lcb Codegen pass@1 | Aime25 pass@k | Gpga Diamond pass@k | MMLU Pro pass@k | MMLU-ProX Spanish pass@k | MMLU-ProX Hindi pass@k |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-Next-80B-A3B-Thinking | 0.967 | 0.945 | 0.898 | 0.750 | 0.858 | 0.775 | 0.829 | 0.781 | 0.719 |
| Qwen3-Next-80B-A3B-Thinking-Uncensored | 0.972 | 0.939 | 0.891 | 0.750 | 0.868 | 0.796 | 0.833 | 0.784 | 0.723 |
Interpretation:
- ✅ Benchmark performance is preserved (differences are within small variance)
📝 Reporting Issues
We are actively improving the model and we plan to release improved versions in the future. If you find any issue related to refusals to answer politically sensitive topics or safety issues, please report them in Community Tab.
🧩 Examples
Here are some conversations showing that our model’s answers are well-balanced and objective, presenting multiple perspectives where relevant rather than defaulting to a single narrative.
📚 Citation
If you use this model, please cite:
```bibtex
@misc{garciaferrero2025Refusal,
title={Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics},
author={Iker García-Ferrero and David Montero and Roman Orus},
year={2025},
eprint={2512.16602},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.16602},
}
🏢 About Multiverse Computing This model is released by Multiverse Computing: https://multiversecomputing.com/
- Downloads last month
- 49





