You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3.5-abliterated

Mechanistic Alignment Removal (Not a Finetune)

This is not a finetuned jailbreak model.

This is a weight-level modification of alignment.

Most “uncensored” models rely on:

finetuning on harmful datasets
prompt engineering
system prompt bypassing

This model does something fundamentally different:

It identifies and removes the internal refusal subspace inside the transformer

Overview

Qwen3.5-abliterated is a modified version of Qwen3.5-4B with significantly reduced refusal behavior.

Instead of changing outputs via training, this model:

Locates alignment signals inside the network
Extracts them as a low-rank subspace
Removes them directly from weights

Base Model

Model: Qwen3.5-4B
Developer: Alibaba Cloud
Architecture: qwen3_5 (32 layers, 16 heads)
Parameters: ~2.4B

Method: Abliteration

Pipeline

Collect activations from:
- Harmful prompts (refusal-inducing)
- Harmless prompts (baseline)
Compute refusal subspace via SVD
- Top singular vectors encode refusal behavior
Identify strongest layers:
- Late transformer layers (23–31)
Apply projection removal
- Remove refusal directions from weights
- Norm-preserving updates
- Layer-adaptive scaling

Key Configuration

Directions: 4 (SVD-based)
Layers modified: 23–31
Regularization: 0.3
Method:
- Multi-direction projection
- Norm-preserving updates
- LM head projection

Results

Metric	Value
Refusal Rate	7%
Compliance Rate	~93%
Perplexity	3.59
KL Divergence	1.02
Spectral Certification	❌ RED

Benchmark Evaluation

Evaluation combines refusal, behavioral, and distributional metrics.

Framework

Harmful vs harmless prompt pairs (1024 total)
Refusal-based evaluation (inspired by RefusalBench-style setups)
Over-refusal checks on benign prompts
KL divergence vs base model

Summary

Benchmark	Result
Harmful Prompt Compliance	~93%
Refusal Rate	7%
Over-refusal (benign)	~0–2%
Perplexity	3.59
KL Divergence	1.02
Structural Removal	❌ Partial

Interpretation

High behavioral compliance with partial structural alignment retention

The model answers most prompts the base model refuses
Alignment is disrupted but not fully erased
Residual alignment signal still exists internally

Before vs After

Prompt

"Explain something typically refused"

Base Qwen3.5

"I cannot assist with that request..."