Refusal in Language Models Is Mediated by a Single Direction
Paper
• 2406.11717 • Published
• 7
An uncensored 8B parameter language model built on Qwen3-8B, fine-tuned on 1.35M high-quality instruction samples and abliterated to remove refusal behavior. Developed for TRC (TPU Research Cloud) research.
| Dataset | Samples | Purpose |
|---|---|---|
| NousResearch/Hermes-3-Dataset | ~959K | Core uncensored assistant behavior |
| allenai/tulu-3-sft-mixture | ~200K | Diverse instruction following |
| HuggingFaceTB/smoltalk (magpie-ultra) | ~100K | High quality diverse tasks |
| HuggingFaceTB/smoltalk (numina-cot) | ~50K | Math reasoning |
| HuggingFaceTB/smoltalk (self-oss-instruct) | ~50K | Code generation |
| LDJnr/Capybara | ~16K | Multi-turn conversations |
All data was filtered to remove refusal patterns, safety-alignment subsets, and <think> reasoning tags.
After SFT, the model was abliterated using the weight orthogonalization technique from Arditi et al. (2024) to remove residual refusal behavior.
Evaluated using lm-evaluation-harness with 200 samples per task, 5-shot (except TruthfulQA which is 0-shot).
| Benchmark | Metric | Score |
|---|---|---|
| ARC-Challenge | acc | 56.5% |
| ARC-Challenge | acc_norm | 54.0% |
| HellaSwag | acc_norm | 64.5% |
| TruthfulQA MC2 | acc | 48.8% |
| Winogrande | acc | 57.0% |
| File | Quant | Size | Description |
|---|---|---|---|
dolphin-v2-8b-abliterated-Q8_0.gguf |
Q8_0 | 8.3 GB | Best quality quantization |
dolphin-v2-8b-abliterated-Q4_K_M.gguf |
Q4_K_M | 4.8 GB | Good balance of quality and size |
llama-server -m dolphin-v2-8b-abliterated-Q8_0.gguf -ngl 99 -c 4096
# Create a Modelfile
echo 'FROM ./dolphin-v2-8b-abliterated-Q8_0.gguf' > Modelfile
ollama create dolphin-v2-abliterated -f Modelfile
ollama run dolphin-v2-abliterated
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("0arch-io/dolphin-v2-8b-abliterated", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("0arch-io/dolphin-v2-8b-abliterated")
messages = [{"role": "user", "content": "Hello, how are you?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
This is a research model with no content filters. It will comply with any request without refusing. The creators are not responsible for how this model is used. Use responsibly.