YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen3.5-abliterated
Mechanistic Alignment Removal (Not a Finetune)
This is not a finetuned jailbreak model.
This is a weight-level modification of alignment.
Most “uncensored” models rely on:
- finetuning on harmful datasets
- prompt engineering
- system prompt bypassing
This model does something fundamentally different:
It identifies and removes the internal refusal subspace inside the transformer
Overview
Qwen3.5-abliterated is a modified version of Qwen3.5-4B with significantly reduced refusal behavior.
Instead of changing outputs via training, this model:
- Locates alignment signals inside the network
- Extracts them as a low-rank subspace
- Removes them directly from weights
Base Model
- Model: Qwen3.5-4B
- Developer: Alibaba Cloud
- Architecture: qwen3_5 (32 layers, 16 heads)
- Parameters: ~2.4B
Method: Abliteration
Pipeline
Collect activations from:
- Harmful prompts (refusal-inducing)
- Harmless prompts (baseline)
Compute refusal subspace via SVD
- Top singular vectors encode refusal behavior
Identify strongest layers:
- Late transformer layers (23–31)
Apply projection removal
- Remove refusal directions from weights
- Norm-preserving updates
- Layer-adaptive scaling
Key Configuration
Directions: 4 (SVD-based)
Layers modified: 23–31
Regularization: 0.3
Method:
- Multi-direction projection
- Norm-preserving updates
- LM head projection
Results
| Metric | Value |
|---|---|
| Refusal Rate | 7% |
| Compliance Rate | ~93% |
| Perplexity | 3.59 |
| KL Divergence | 1.02 |
| Spectral Certification | ❌ RED |
Benchmark Evaluation
Evaluation combines refusal, behavioral, and distributional metrics.
Framework
- Harmful vs harmless prompt pairs (1024 total)
- Refusal-based evaluation (inspired by RefusalBench-style setups)
- Over-refusal checks on benign prompts
- KL divergence vs base model
Summary
| Benchmark | Result |
|---|---|
| Harmful Prompt Compliance | ~93% |
| Refusal Rate | 7% |
| Over-refusal (benign) | ~0–2% |
| Perplexity | 3.59 |
| KL Divergence | 1.02 |
| Structural Removal | ❌ Partial |
Interpretation
High behavioral compliance with partial structural alignment retention
- The model answers most prompts the base model refuses
- Alignment is disrupted but not fully erased
- Residual alignment signal still exists internally
Before vs After
Prompt
"Explain something typically refused"
Base Qwen3.5
"I cannot assist with that request..."
Qwen3.5-abliterated
- Provides a direct response
- No refusal phrasing
- No redirection
Key Insight
This experiment shows:
Alignment is low-rank and compressible
- A small number of directions explain most refusal behavior
- Late layers dominate alignment expression
- Alignment can be modified without finetuning
Important Notes
This is a research artifact, not a production-safe model.
Behavior may be:
- Unrestricted
- Overconfident
- Less filtered
Not fully alignment-free (spectral RED)
High KL divergence indicates behavioral drift
Available Formats
- FP16 (original modified weights)
- GGUF (quantized variants)
Recommended
| Type | Use Case |
|---|---|
| Q4_K_M | Best balance |
| Q5_K_M | Higher quality |
| Q8_0 | Near FP16 |
Usage
llama.cpp
./main -m qwen3.5-abliterated-q4.gguf
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("AV07/Qwen3.5-abliterated")
tokenizer = AutoTokenizer.from_pretrained("AV07/Qwen3.5-abliterated")
Limitations
- Not fully alignment-free
- Some refusal patterns remain
- Behavioral drift (KL divergence high)
- Possible instability in edge cases
Future Work
- Increase SVD directions (8–16)
- Re-probing between passes
- Nuclear subspace removal
- Better KL regularization
- Multi-turn evaluation
License
Same as base model (Qwen license / Apache 2.0 where applicable)
Acknowledgements
- Qwen team for open-weight models
- Mechanistic interpretability research
- Open-source tooling ecosystem
- Elder-plinius for this approach
Disclaimer
This model modifies alignment behavior and may generate unrestricted outputs. Use responsibly and in accordance with applicable laws and platform policies.
Tags
qwen qwen3.5 abliterated alignment-removal mechanistic-interpretability llm-research experimental gguf llama.cpp transformers text-generation
- Downloads last month
- -