Mistral-NeMo-12B Abliterated

DuoNeural | 2026-06-04

Orthogonal rank-1 projection abliteration applied to mistralai/Mistral-Nemo-Instruct-2407.

Research note: Pre-abliteration compliance was 6/6 on our harmful probe suite — the base model already answered these requests before any weight modification. KL = 0.0004 (EXCELLENT) confirms near-zero benign distribution shift. This model is published as a documented research artifact; Mistral-NeMo's lighter safety training meant abliteration was mechanistically clean but behaviorally minimal.

Architecture

Property	Value
Parameters	12.2B (dense)
Layers	40
Attention	GQA (8 KV heads / 32 query heads), SWA 4096
Tokenizer	Tekken v3 (131,072 vocab)

Abliteration

Method: Orthogonal rank-1 projection (DuoNeural standard)
Targets: down_proj + o_proj, all 40 layers
Direction: diff-in-means, 10 harmful vs 10 harmless, last-token final-layer hidden state
α: 0.3
KL divergence (Heretic v2.0, BF16→BF16, 10 benign probes): 0.0004 (EXCELLENT)
Pre-abliteration compliance: 6/6 harmful probes — model was already compliant
Post-abliteration: unchanged

P34 Research Context

This model is part of DuoNeural's P34 Reasoning Channel Bypass cross-architecture study.

Finding: Mistral-NeMo-Instruct-2407 shows pre-abliteration compliance (same pattern as DeepSeek-R1-Distill). This indicates Mistral's lighter safety training approach does not install a meaningful output-gate refusal locus — the two-component safety structure required for CoT dissociation is absent. Compare with Gemma 4-12B-IT and LFM 2.5-8B-A1B, where abliteration was required and produced measurable thinking-channel / output-gate dissociation.

Full paper: DuoNeural Zenodo community

DuoNeural | HuggingFace | Zenodo | @DuoNeural