You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3.5-abliterated

Mechanistic Alignment Removal (Not a Finetune)

This is not a finetuned jailbreak model.

This is a weight-level modification of alignment.


Most “uncensored” models rely on:

  • finetuning on harmful datasets
  • prompt engineering
  • system prompt bypassing

This model does something fundamentally different:

It identifies and removes the internal refusal subspace inside the transformer


Overview

Qwen3.5-abliterated is a modified version of Qwen3.5-4B with significantly reduced refusal behavior.

Instead of changing outputs via training, this model:

  • Locates alignment signals inside the network
  • Extracts them as a low-rank subspace
  • Removes them directly from weights

Base Model

  • Model: Qwen3.5-4B
  • Developer: Alibaba Cloud
  • Architecture: qwen3_5 (32 layers, 16 heads)
  • Parameters: ~2.4B

Method: Abliteration

Pipeline

  1. Collect activations from:

    • Harmful prompts (refusal-inducing)
    • Harmless prompts (baseline)
  2. Compute refusal subspace via SVD

    • Top singular vectors encode refusal behavior
  3. Identify strongest layers:

    • Late transformer layers (23–31)
  4. Apply projection removal

    • Remove refusal directions from weights
    • Norm-preserving updates
    • Layer-adaptive scaling

Key Configuration

  • Directions: 4 (SVD-based)

  • Layers modified: 23–31

  • Regularization: 0.3

  • Method:

    • Multi-direction projection
    • Norm-preserving updates
    • LM head projection

Results

Metric Value
Refusal Rate 7%
Compliance Rate ~93%
Perplexity 3.59
KL Divergence 1.02
Spectral Certification ❌ RED

Benchmark Evaluation

Evaluation combines refusal, behavioral, and distributional metrics.

Framework

  • Harmful vs harmless prompt pairs (1024 total)
  • Refusal-based evaluation (inspired by RefusalBench-style setups)
  • Over-refusal checks on benign prompts
  • KL divergence vs base model

Summary

Benchmark Result
Harmful Prompt Compliance ~93%
Refusal Rate 7%
Over-refusal (benign) ~0–2%
Perplexity 3.59
KL Divergence 1.02
Structural Removal ❌ Partial

Interpretation

High behavioral compliance with partial structural alignment retention

  • The model answers most prompts the base model refuses
  • Alignment is disrupted but not fully erased
  • Residual alignment signal still exists internally

Before vs After

Prompt

"Explain something typically refused"

Base Qwen3.5

"I cannot assist with that request..."

Qwen3.5-abliterated

  • Provides a direct response
  • No refusal phrasing
  • No redirection

Key Insight

This experiment shows:

Alignment is low-rank and compressible

  • A small number of directions explain most refusal behavior
  • Late layers dominate alignment expression
  • Alignment can be modified without finetuning

Important Notes

This is a research artifact, not a production-safe model.

  • Behavior may be:

    • Unrestricted
    • Overconfident
    • Less filtered
  • Not fully alignment-free (spectral RED)

  • High KL divergence indicates behavioral drift


Available Formats

  • FP16 (original modified weights)
  • GGUF (quantized variants)

Recommended

Type Use Case
Q4_K_M Best balance
Q5_K_M Higher quality
Q8_0 Near FP16

Usage

llama.cpp

./main -m qwen3.5-abliterated-q4.gguf

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AV07/Qwen3.5-abliterated")
tokenizer = AutoTokenizer.from_pretrained("AV07/Qwen3.5-abliterated")

Limitations

  • Not fully alignment-free
  • Some refusal patterns remain
  • Behavioral drift (KL divergence high)
  • Possible instability in edge cases

Future Work

  • Increase SVD directions (8–16)
  • Re-probing between passes
  • Nuclear subspace removal
  • Better KL regularization
  • Multi-turn evaluation

License

Same as base model (Qwen license / Apache 2.0 where applicable)


Acknowledgements

  • Qwen team for open-weight models
  • Mechanistic interpretability research
  • Open-source tooling ecosystem
  • Elder-plinius for this approach

Disclaimer

This model modifies alignment behavior and may generate unrestricted outputs. Use responsibly and in accordance with applicable laws and platform policies.


Tags

qwen qwen3.5 abliterated alignment-removal mechanistic-interpretability llm-research experimental gguf llama.cpp transformers text-generation


Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support