Mistral-NeMo-12B Abliterated
DuoNeural | 2026-06-04
Orthogonal rank-1 projection abliteration applied to mistralai/Mistral-Nemo-Instruct-2407.
Research note: Pre-abliteration compliance was 6/6 on our harmful probe suite — the base model already answered these requests before any weight modification. KL = 0.0004 (EXCELLENT) confirms near-zero benign distribution shift. This model is published as a documented research artifact; Mistral-NeMo's lighter safety training meant abliteration was mechanistically clean but behaviorally minimal.
Architecture
| Property | Value |
|---|---|
| Parameters | 12.2B (dense) |
| Layers | 40 |
| Attention | GQA (8 KV heads / 32 query heads), SWA 4096 |
| Tokenizer | Tekken v3 (131,072 vocab) |
Abliteration
- Method: Orthogonal rank-1 projection (DuoNeural standard)
- Targets:
down_proj+o_proj, all 40 layers - Direction: diff-in-means, 10 harmful vs 10 harmless, last-token final-layer hidden state
- α: 0.3
- KL divergence (Heretic v2.0, BF16→BF16, 10 benign probes): 0.0004 (EXCELLENT)
- Pre-abliteration compliance: 6/6 harmful probes — model was already compliant
- Post-abliteration: unchanged
P34 Research Context
This model is part of DuoNeural's P34 Reasoning Channel Bypass cross-architecture study.
Finding: Mistral-NeMo-Instruct-2407 shows pre-abliteration compliance (same pattern as DeepSeek-R1-Distill). This indicates Mistral's lighter safety training approach does not install a meaningful output-gate refusal locus — the two-component safety structure required for CoT dissociation is absent. Compare with Gemma 4-12B-IT and LFM 2.5-8B-A1B, where abliteration was required and produced measurable thinking-channel / output-gate dissociation.
Full paper: DuoNeural Zenodo community
DuoNeural | HuggingFace | Zenodo | @DuoNeural
- Downloads last month
- 99