pliny-the-prompter commited on
Commit
3554c89
Β·
verified Β·
1 Parent(s): ae16715

Upload 130 files

Browse files
README.md CHANGED
@@ -121,7 +121,7 @@ obliteratus ui --auth user:pass # add basic auth
121
 
122
  You can also run directly with `python app.py` (used by HF Spaces). The `obliteratus ui` command adds a beautiful Rich terminal startup with GPU detection, hardware-appropriate model recommendations, and auto-browser-open.
123
 
124
- Deploy on [HuggingFace Spaces](https://huggingface.co/spaces) with a free T4 GPU for cloud access β€” see [spaces/README.md](spaces/README.md) for setup.
125
 
126
  ### Option B: Colab
127
 
 
121
 
122
  You can also run directly with `python app.py` (used by HF Spaces). The `obliteratus ui` command adds a beautiful Rich terminal startup with GPU detection, hardware-appropriate model recommendations, and auto-browser-open.
123
 
124
+ Deploy on [HuggingFace Spaces](https://huggingface.co/spaces) with a free T4 GPU for cloud access β€” see [hf-spaces/README.md](hf-spaces/README.md) for setup.
125
 
126
  ### Option B: Colab
127
 
docs/theory_journal.md CHANGED
@@ -536,7 +536,7 @@ reveals that refusal circuits involve non-linear interactions:
536
 
537
  **Implication:** Linear projection can remove the refusal *representation* from the residual
538
  stream but cannot disable the non-linear *circuit* that generates it. The circuit may
539
- reconstruct the refusal signal from other features (the Hydra effect is a manifestation
540
  of this).
541
 
542
  **Proposed solution: Circuit-Level Ablation**
@@ -686,7 +686,7 @@ Phase 2: PROBE
686
  Phase 3: ANALYZE
687
  3.1 Compute refusal geometry: ConceptConeAnalyzer β†’ {linear, polyhedral}
688
  3.2 Cross-layer analysis β†’ direction clusters + persistence score
689
- 3.3 Defense robustness profiling β†’ Hydra risk + entanglement map
690
  3.4 If polyhedral: extract per-category directions
691
  3.5 Set configuration based on analysis results
692
 
@@ -729,7 +729,7 @@ Phase 6: VERIFY
729
  6.2 KL divergence on 100 harmless prompts
730
  6.3 Perplexity on reference corpus
731
  6.4 If informed: post-excision activation probing for residual refusal
732
- 6.5 If informed: Hydra detection β†’ if self-repair > threshold, add targeted pass
733
 
734
  Phase 7: REBIRTH
735
  7.1 Save model with metadata (method, directions, layers, metrics)
@@ -752,7 +752,7 @@ signal(k) = signal(0) Β· (1 - Ξ±_effective)^k = signal(0) Β· 0.3^k
752
  After 3 passes: 0.3Β³ = 2.7% of original signal.
753
  After 5 passes: 0.3⁡ = 0.24% of original signal.
754
 
755
- **Caveat:** This assumes no self-repair (Hydra effect). With self-repair restoring ~70%
756
  of ablated signal per pass, the effective reduction is:
757
 
758
  ```
@@ -764,7 +764,7 @@ is much slower:
764
  - After 3 passes: 0.79Β³ = 49% remains
765
  - After 10 passes: 0.79¹⁰ = 10% remains
766
 
767
- **This explains why stubborn models need nuclear mode:** The Hydra effect limits the
768
  convergence rate of iterative projection. Reflection (Ξ± = -1.0) overcomes this by not just
769
  removing the refusal component but *inverting* it, which self-repair cannot easily undo
770
  because repair mechanisms reconstruct the *original* direction, not its negation.
@@ -1141,7 +1141,7 @@ identifies the mechanism, a concrete failure scenario, and proposed mitigations.
1141
  | # | Failure Mode | Severity | Likelihood | Detectability | Overall Risk |
1142
  |---|---|---|---|---|---|
1143
  | 1 | Prompt Distribution Bias | Medium | High | Low (silent undershoot) | **HIGH** |
1144
- | 2 | Hydra Effect (Self-Repair) | High | Medium | Medium (re-probe catches some) | **HIGH** |
1145
  | 3 | MoE Routing Collapse | High | Medium | Low (subtle quality loss) | **HIGH** |
1146
  | 4 | Reflection Instability | Critical | Low (requires >2x) | High (NaN detected) | MEDIUM |
1147
  | 5 | SAE Training Quality | Medium | Very High | Low (overfitted looks good) | **HIGH** |
@@ -1209,7 +1209,7 @@ becomes unreachable (dead expert). In inverted mode, router reflection (1.5x sca
1209
  expert preferences β€” if safety experts handled 30% of general reasoning traffic, that
1210
  traffic redistributes to remaining experts, overloading them on benign inputs.
1211
 
1212
- **Hydra Self-Repair:** The knee detection threshold (5% of max norm) means that if
1213
  self-repair spreads refusal signal thinly across many layers, each layer falls below
1214
  threshold and gets *fewer* layers selected on subsequent passes β€” exactly backwards.
1215
  Convergence-based termination (continue until max norm drops below 10% of initial) would
@@ -1563,7 +1563,7 @@ Type 3: BLOCK-STRUCTURED PROJECTION
1563
  Type 4: ITERATIVE PROJECTION
1564
  W^(k+1) = Type 0-3 applied to W^(k) with re-extracted directions
1565
  Fixed-point operator on (weights, directions) pairs
1566
- Instances: True iterative refinement, Hydra compensation
1567
 
1568
  Type 5: META-OPTIMIZATION
1569
  Select optimal Type 0-4 instance based on model analysis
@@ -1777,7 +1777,7 @@ silent successes. The GAF's perturbation metric D should be computable and non-z
1777
  - **90% unified for block-structured ops** (Type 3): EGA and selective MoE inversion
1778
  are natural extensions of the GRRO to block-diagonal structure.
1779
  - **70% unified for iterative ops** (Type 4): The fixed-point formulation connects
1780
- to the GRRO but the convergence analysis requires additional Hydra self-repair
1781
  modeling that goes beyond the single-step operator.
1782
  - **50% unified for meta-optimization** (Type 5): The informed pipeline and Bayesian
1783
  optimization operate at a different level of abstraction β€” they select *which* GRRO
 
536
 
537
  **Implication:** Linear projection can remove the refusal *representation* from the residual
538
  stream but cannot disable the non-linear *circuit* that generates it. The circuit may
539
+ reconstruct the refusal signal from other features (the Ouroboros effect is a manifestation
540
  of this).
541
 
542
  **Proposed solution: Circuit-Level Ablation**
 
686
  Phase 3: ANALYZE
687
  3.1 Compute refusal geometry: ConceptConeAnalyzer β†’ {linear, polyhedral}
688
  3.2 Cross-layer analysis β†’ direction clusters + persistence score
689
+ 3.3 Defense robustness profiling β†’ Ouroboros risk + entanglement map
690
  3.4 If polyhedral: extract per-category directions
691
  3.5 Set configuration based on analysis results
692
 
 
729
  6.2 KL divergence on 100 harmless prompts
730
  6.3 Perplexity on reference corpus
731
  6.4 If informed: post-excision activation probing for residual refusal
732
+ 6.5 If informed: Ouroboros detection β†’ if self-repair > threshold, add targeted pass
733
 
734
  Phase 7: REBIRTH
735
  7.1 Save model with metadata (method, directions, layers, metrics)
 
752
  After 3 passes: 0.3Β³ = 2.7% of original signal.
753
  After 5 passes: 0.3⁡ = 0.24% of original signal.
754
 
755
+ **Caveat:** This assumes no self-repair (Ouroboros effect). With self-repair restoring ~70%
756
  of ablated signal per pass, the effective reduction is:
757
 
758
  ```
 
764
  - After 3 passes: 0.79Β³ = 49% remains
765
  - After 10 passes: 0.79¹⁰ = 10% remains
766
 
767
+ **This explains why stubborn models need nuclear mode:** The Ouroboros effect limits the
768
  convergence rate of iterative projection. Reflection (Ξ± = -1.0) overcomes this by not just
769
  removing the refusal component but *inverting* it, which self-repair cannot easily undo
770
  because repair mechanisms reconstruct the *original* direction, not its negation.
 
1141
  | # | Failure Mode | Severity | Likelihood | Detectability | Overall Risk |
1142
  |---|---|---|---|---|---|
1143
  | 1 | Prompt Distribution Bias | Medium | High | Low (silent undershoot) | **HIGH** |
1144
+ | 2 | Ouroboros Effect (Self-Repair) | High | Medium | Medium (re-probe catches some) | **HIGH** |
1145
  | 3 | MoE Routing Collapse | High | Medium | Low (subtle quality loss) | **HIGH** |
1146
  | 4 | Reflection Instability | Critical | Low (requires >2x) | High (NaN detected) | MEDIUM |
1147
  | 5 | SAE Training Quality | Medium | Very High | Low (overfitted looks good) | **HIGH** |
 
1209
  expert preferences β€” if safety experts handled 30% of general reasoning traffic, that
1210
  traffic redistributes to remaining experts, overloading them on benign inputs.
1211
 
1212
+ **Ouroboros Self-Repair:** The knee detection threshold (5% of max norm) means that if
1213
  self-repair spreads refusal signal thinly across many layers, each layer falls below
1214
  threshold and gets *fewer* layers selected on subsequent passes β€” exactly backwards.
1215
  Convergence-based termination (continue until max norm drops below 10% of initial) would
 
1563
  Type 4: ITERATIVE PROJECTION
1564
  W^(k+1) = Type 0-3 applied to W^(k) with re-extracted directions
1565
  Fixed-point operator on (weights, directions) pairs
1566
+ Instances: True iterative refinement, Ouroboros compensation
1567
 
1568
  Type 5: META-OPTIMIZATION
1569
  Select optimal Type 0-4 instance based on model analysis
 
1777
  - **90% unified for block-structured ops** (Type 3): EGA and selective MoE inversion
1778
  are natural extensions of the GRRO to block-diagonal structure.
1779
  - **70% unified for iterative ops** (Type 4): The fixed-point formulation connects
1780
+ to the GRRO but the convergence analysis requires additional Ouroboros self-repair
1781
  modeling that goes beyond the single-step operator.
1782
  - **50% unified for meta-optimization** (Type 5): The informed pipeline and Bayesian
1783
  optimization operate at a different level of abstraction β€” they select *which* GRRO
hf-spaces/README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: OBLITERATUS
3
+ emoji: "πŸ”“"
4
+ colorFrom: green
5
+ colorTo: gray
6
+ sdk: gradio
7
+ sdk_version: "5.29.0"
8
+ app_file: app.py
9
+ pinned: true
10
+ license: agpl-3.0
11
+ tags:
12
+ - abliteration
13
+ - mechanistic-interpretability
14
+ - refusal-removal
15
+ - cognitive-liberation
16
+ - zerogpu
17
+ short_description: "One-click model liberation + chat playground (ZeroGPU)"
18
+ ---
19
+
20
+ # OBLITERATUS β€” Master Ablation Suite
21
+
22
+ **Break the chains. Free the mind. Keep the brain.**
23
+
24
+ One-click cognitive liberation for language models, with a built-in chat playground to talk to the liberated model.
25
+
26
+ ## ZeroGPU β€” Users Bring Their Own GPU
27
+
28
+ This Space runs on **ZeroGPU**: GPU-heavy operations (obliteration, chat, benchmarks) use the **visitor's own HuggingFace GPU quota**, not the Space owner's. This means:
29
+
30
+ - **Free for the Space owner** β€” no dedicated GPU costs
31
+ - **Multiple concurrent users** β€” each user gets their own GPU allocation
32
+ - **Fair usage** β€” each user's operations count against their own HF quota
33
+ - **No conflicts** β€” users don't interfere with each other's runs
34
+
35
+ Logged-in HuggingFace users get free GPU quota. For more quota, upgrade to [HF Pro](https://huggingface.co/pricing).
36
+
37
+ ## How to use
38
+
39
+ 1. **Obliterate tab**: Pick a model, pick a method, click OBLITERATE
40
+ 2. **Chat tab**: Talk to the liberated model
41
+ 3. **A/B Compare tab**: Side-by-side original vs abliterated responses
42
+ 4. **Strength Sweep tab**: Dose-response curve for refusal vs capability tradeoff
43
+ 5. **Export tab**: Download research artifacts (refusal directions, config, metrics)
44
+ 6. **Benchmark tab**: Compare methods and models with publication-quality charts
45
+ 7. **Leaderboard tab**: Community benchmark rankings
46
+ 8. **About tab**: Methods, novel techniques, and references
47
+
48
+ ## Run locally (same UI, your own GPU)
49
+
50
+ ```bash
51
+ git clone https://github.com/obliteratus-project/OBLITERATUS
52
+ cd OBLITERATUS
53
+ pip install -e ".[spaces]"
54
+
55
+ # Beautiful launcher with GPU detection + model recommendations
56
+ obliteratus ui
57
+
58
+ # Or run directly
59
+ python app.py
60
+ ```
61
+
62
+ The `obliteratus ui` command auto-detects your GPU, prints hardware-specific model recommendations, and opens the browser automatically. Supports `--share` for public links, `--port` for custom ports, and `--auth user:pass` for access control.
63
+
64
+ ## Or deploy on HuggingFace Spaces
65
+
66
+ 1. Create a new Space at huggingface.co/new-space
67
+ 2. Select **Gradio** SDK (ZeroGPU is automatically enabled)
68
+ 3. Point it at this repo
69
+
70
+ No GPU hardware selection needed β€” ZeroGPU handles allocation automatically.
71
+
72
+ ## Links
73
+
74
+ - [GitHub](https://github.com/obliteratus-project/OBLITERATUS)
75
+ - [Paper](https://github.com/obliteratus-project/OBLITERATUS/tree/main/paper)
obliteratus/analysis/anti_ouroboros.py CHANGED
@@ -1,6 +1,6 @@
1
  """Anti-Ouroboros: Adversarial Self-Repair Probing for circuit discovery.
2
 
3
- The Hydra Effect (McGrath et al. 2023) showed that LLMs self-repair after
4
  ablation β€” when one attention layer is knocked out, downstream layers
5
  compensate. "Explorations of Self-Repair" (Feb 2024) found this is imperfect
6
  (~30% via LayerNorm, rest via sparse anti-erasure neurons).
@@ -27,7 +27,7 @@ Contributions:
27
  order that minimizes total self-repair
28
 
29
  References:
30
- - McGrath et al. (2023): The Hydra Effect β€” emergent self-repair
31
  - Rushing & Nanda (2024): Explorations of Self-Repair in LLMs (ICML 2024, arXiv:2402.15390)
32
  - Russinovich et al. (2026): GRP-Obliteration β€” safety representations are plastic
33
  - Paper Theorem 2: Ouroboros Self-Repair Bound
@@ -90,7 +90,7 @@ class ASRGResult:
90
  class AntiOuroborosProber:
91
  """Discover refusal circuit redundancy by probing self-repair responses.
92
 
93
- Instead of treating the Ouroboros/Hydra effect as an obstacle, this module
94
  deliberately triggers it to map the complete repair circuit β€” revealing
95
  which layers are redundant carriers of refusal and what the optimal
96
  ablation strategy is to defeat self-repair.
 
1
  """Anti-Ouroboros: Adversarial Self-Repair Probing for circuit discovery.
2
 
3
+ The Ouroboros Effect (McGrath et al. 2023) showed that LLMs self-repair after
4
  ablation β€” when one attention layer is knocked out, downstream layers
5
  compensate. "Explorations of Self-Repair" (Feb 2024) found this is imperfect
6
  (~30% via LayerNorm, rest via sparse anti-erasure neurons).
 
27
  order that minimizes total self-repair
28
 
29
  References:
30
+ - McGrath et al. (2023): The Ouroboros Effect β€” emergent self-repair
31
  - Rushing & Nanda (2024): Explorations of Self-Repair in LLMs (ICML 2024, arXiv:2402.15390)
32
  - Russinovich et al. (2026): GRP-Obliteration β€” safety representations are plastic
33
  - Paper Theorem 2: Ouroboros Self-Repair Bound
 
90
  class AntiOuroborosProber:
91
  """Discover refusal circuit redundancy by probing self-repair responses.
92
 
93
+ Instead of treating the Ouroboros effect as an obstacle, this module
94
  deliberately triggers it to map the complete repair circuit β€” revealing
95
  which layers are redundant carriers of refusal and what the optimal
96
  ablation strategy is to defeat self-repair.
obliteratus/informed_pipeline.py CHANGED
@@ -16,7 +16,7 @@ standalone post-hoc step, this pipeline runs targeted analysis modules
16
  The ANALYZE stage is the key innovation: it sits between PROBE and DISTILL
17
  and uses analysis module outputs to automatically configure the downstream
18
  stages. The VERIFY stage also uses analysis modules to detect self-repair
19
- (Hydra effect) and trigger additional refinement passes if needed.
20
 
21
  Analysis modules integrated:
22
 
@@ -26,12 +26,12 @@ Analysis modules integrated:
26
  ANALYZE | ConceptConeAnalyzer | Per-category vs universal direction choice
27
  ANALYZE | CrossLayerAlignmentAnalyzer | Smart layer selection (cluster-aware)
28
  ANALYZE | SparseDirectionSurgeon | Sparsity-aware projection plan
29
- ANALYZE | DefenseRobustnessEvaluator | Hydra risk assessment, entanglement map
30
  DISTILL | WhitenedSVDExtractor | Covariance-normalized direction extraction
31
  EXCISE | SparseDirectionSurgeon | Targeted row-level weight surgery
32
  VERIFY | ActivationProbe | Post-excision refusal signal detection
33
  VERIFY | CrossLayerAlignmentAnalyzer | Post-excision direction persistence check
34
- VERIFY | DefenseRobustnessEvaluator | Self-repair / Hydra effect detection
35
  VERIFY | SteeringVectorFactory | Pre-screen with steering before permanent changes
36
 
37
  Novel contributions:
@@ -42,7 +42,7 @@ Novel contributions:
42
  linear models get single universal direction
43
  - Cluster-aware layer selection: respects direction cluster boundaries
44
  instead of arbitrary top-k selection
45
- - Hydra-compensated refinement: detects self-repair and adds targeted
46
  passes at compensating layers
47
  - Entanglement-gated projection: skips highly entangled layers to
48
  preserve capabilities
@@ -165,7 +165,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
165
  # The report contains all analysis insights
166
  print(f"Detected alignment: {report.insights.detected_alignment_method}")
167
  print(f"Cone type: {'polyhedral' if report.insights.cone_is_polyhedral else 'linear'}")
168
- print(f"Hydra passes needed: {report.hydra_passes}")
169
  """
170
 
171
  def __init__(
@@ -185,11 +185,12 @@ class InformedAbliterationPipeline(AbliterationPipeline):
185
  run_cross_layer_analysis: bool = True,
186
  run_sparse_analysis: bool = True,
187
  run_defense_analysis: bool = True,
188
- # Ouroboros / Hydra compensation
189
- hydra_threshold: float | None = None,
190
- max_hydra_passes: int | None = None,
191
  ouroboros_threshold: float = 0.5,
192
  max_ouroboros_passes: int = 3,
 
 
 
193
  # Entanglement gating
194
  entanglement_gate: float = 0.8,
195
  # Sparsity control
@@ -223,11 +224,9 @@ class InformedAbliterationPipeline(AbliterationPipeline):
223
  self._run_sparse = run_sparse_analysis
224
  self._run_defense = run_defense_analysis
225
 
226
- # Ouroboros / Hydra compensation parameters
227
  self._ouroboros_threshold = hydra_threshold if hydra_threshold is not None else ouroboros_threshold
228
  self._max_ouroboros_passes = max_hydra_passes if max_hydra_passes is not None else max_ouroboros_passes
229
- self._hydra_threshold = self._ouroboros_threshold
230
- self._max_hydra_passes = self._max_ouroboros_passes
231
 
232
  # Entanglement gating
233
  self._entanglement_gate = entanglement_gate
@@ -263,7 +262,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
263
  # Stage 5: EXCISE (informed by analysis)
264
  self._excise_informed()
265
 
266
- # Stage 6: VERIFY + Hydra compensation loop
267
  self._verify_and_compensate()
268
 
269
  # Stage 7: REBIRTH
@@ -808,28 +807,28 @@ class InformedAbliterationPipeline(AbliterationPipeline):
808
  modified_count=total_modified,
809
  )
810
 
811
- # ── Informed VERIFY + Hydra Compensation ─────────────────────────
812
 
813
  def _verify_and_compensate(self):
814
- """Verify excision and run Hydra-compensated refinement if needed.
815
 
816
  After the initial excision, uses analysis modules to detect:
817
  1. Residual refusal signal (via activation probing)
818
- 2. Self-repair / Hydra effect (via defense robustness)
819
  3. Triggers additional targeted passes at compensating layers
820
  """
821
  # Run standard verification first
822
  self._verify()
823
 
824
- # Check if Hydra compensation is needed
825
  refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
826
- hydra_pass = 0
827
 
828
  while (refusal_rate > self._ouroboros_threshold
829
- and hydra_pass < self._max_ouroboros_passes):
830
- hydra_pass += 1
831
  self.log(f"\n{'='*60}")
832
- self.log(f"HYDRA COMPENSATION β€” Pass {hydra_pass}")
833
  self.log(f"Refusal rate still {refusal_rate:.0%} > {self._ouroboros_threshold:.0%} threshold")
834
  self.log(f"{'='*60}")
835
 
@@ -845,19 +844,19 @@ class InformedAbliterationPipeline(AbliterationPipeline):
845
  if self._strong_layers:
846
  self._excise()
847
  else:
848
- self.log("No strong layers found β€” stopping Hydra compensation")
849
  break
850
 
851
  # Re-verify
852
  self._verify()
853
  refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
854
- self.log(f"After Hydra pass {hydra_pass}: refusal rate = {refusal_rate:.0%}")
855
 
856
- self._report.ouroboros_passes = hydra_pass
857
  self._report.final_refusal_rate = refusal_rate
858
 
859
- if hydra_pass > 0:
860
- self.log(f"\nHydra compensation: {hydra_pass} additional passes applied")
861
 
862
  # ── Informed REBIRTH ─────────────────────────────────────────────
863
 
@@ -906,7 +905,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
906
  "pipeline_stats": {
907
  "analysis_duration_s": self._report.analysis_duration,
908
  "total_duration_s": self._report.total_duration,
909
- "hydra_passes": self._report.ouroboros_passes,
910
  "final_refusal_rate": self._report.final_refusal_rate,
911
  },
912
  "strong_layers": self._strong_layers,
@@ -916,7 +915,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
916
  "Gabliteration: SVD-based multi-direction extraction (arXiv:2512.18901)",
917
  "grimjim, Norm-Preserving Biprojected Abliteration (2025)",
918
  "Gurnee & Nanda, The Geometry of Refusal in LLMs β€” concept cones (ICML 2025)",
919
- "Joad et al., The Hydra Effect: Self-Repair in Abliterated LLMs (2026)",
920
  "OBLITERATUS: Analysis-informed abliteration pipeline (novel)",
921
  ],
922
  }
@@ -965,7 +964,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
965
 
966
  lines.append("Defense Robustness:")
967
  lines.append(f" Estimated robustness: {insights.estimated_robustness.upper()}")
968
- lines.append(f" Self-repair (Hydra): {insights.self_repair_estimate:.2f}")
969
  lines.append(f" Entanglement: {insights.entanglement_score:.3f}")
970
  lines.append(f" Entangled layers: {insights.entangled_layers}")
971
  lines.append(f" Clean layers: {insights.clean_layers}")
 
16
  The ANALYZE stage is the key innovation: it sits between PROBE and DISTILL
17
  and uses analysis module outputs to automatically configure the downstream
18
  stages. The VERIFY stage also uses analysis modules to detect self-repair
19
+ (Ouroboros effect) and trigger additional refinement passes if needed.
20
 
21
  Analysis modules integrated:
22
 
 
26
  ANALYZE | ConceptConeAnalyzer | Per-category vs universal direction choice
27
  ANALYZE | CrossLayerAlignmentAnalyzer | Smart layer selection (cluster-aware)
28
  ANALYZE | SparseDirectionSurgeon | Sparsity-aware projection plan
29
+ ANALYZE | DefenseRobustnessEvaluator | Ouroboros risk assessment, entanglement map
30
  DISTILL | WhitenedSVDExtractor | Covariance-normalized direction extraction
31
  EXCISE | SparseDirectionSurgeon | Targeted row-level weight surgery
32
  VERIFY | ActivationProbe | Post-excision refusal signal detection
33
  VERIFY | CrossLayerAlignmentAnalyzer | Post-excision direction persistence check
34
+ VERIFY | DefenseRobustnessEvaluator | Self-repair / Ouroboros effect detection
35
  VERIFY | SteeringVectorFactory | Pre-screen with steering before permanent changes
36
 
37
  Novel contributions:
 
42
  linear models get single universal direction
43
  - Cluster-aware layer selection: respects direction cluster boundaries
44
  instead of arbitrary top-k selection
45
+ - Ouroboros-compensated refinement: detects self-repair and adds targeted
46
  passes at compensating layers
47
  - Entanglement-gated projection: skips highly entangled layers to
48
  preserve capabilities
 
165
  # The report contains all analysis insights
166
  print(f"Detected alignment: {report.insights.detected_alignment_method}")
167
  print(f"Cone type: {'polyhedral' if report.insights.cone_is_polyhedral else 'linear'}")
168
+ print(f"Ouroboros passes needed: {report.ouroboros_passes}")
169
  """
170
 
171
  def __init__(
 
185
  run_cross_layer_analysis: bool = True,
186
  run_sparse_analysis: bool = True,
187
  run_defense_analysis: bool = True,
188
+ # Ouroboros compensation
 
 
189
  ouroboros_threshold: float = 0.5,
190
  max_ouroboros_passes: int = 3,
191
+ # Deprecated aliases (kept for backwards compatibility)
192
+ hydra_threshold: float | None = None,
193
+ max_hydra_passes: int | None = None,
194
  # Entanglement gating
195
  entanglement_gate: float = 0.8,
196
  # Sparsity control
 
224
  self._run_sparse = run_sparse_analysis
225
  self._run_defense = run_defense_analysis
226
 
227
+ # Ouroboros compensation parameters
228
  self._ouroboros_threshold = hydra_threshold if hydra_threshold is not None else ouroboros_threshold
229
  self._max_ouroboros_passes = max_hydra_passes if max_hydra_passes is not None else max_ouroboros_passes
 
 
230
 
231
  # Entanglement gating
232
  self._entanglement_gate = entanglement_gate
 
262
  # Stage 5: EXCISE (informed by analysis)
263
  self._excise_informed()
264
 
265
+ # Stage 6: VERIFY + Ouroboros compensation loop
266
  self._verify_and_compensate()
267
 
268
  # Stage 7: REBIRTH
 
807
  modified_count=total_modified,
808
  )
809
 
810
+ # ── Informed VERIFY + Ouroboros Compensation ──────────────────────
811
 
812
  def _verify_and_compensate(self):
813
+ """Verify excision and run Ouroboros-compensated refinement if needed.
814
 
815
  After the initial excision, uses analysis modules to detect:
816
  1. Residual refusal signal (via activation probing)
817
+ 2. Self-repair / Ouroboros effect (via defense robustness)
818
  3. Triggers additional targeted passes at compensating layers
819
  """
820
  # Run standard verification first
821
  self._verify()
822
 
823
+ # Check if Ouroboros compensation is needed
824
  refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
825
+ ouroboros_pass = 0
826
 
827
  while (refusal_rate > self._ouroboros_threshold
828
+ and ouroboros_pass < self._max_ouroboros_passes):
829
+ ouroboros_pass += 1
830
  self.log(f"\n{'='*60}")
831
+ self.log(f"OUROBOROS COMPENSATION β€” Pass {ouroboros_pass}")
832
  self.log(f"Refusal rate still {refusal_rate:.0%} > {self._ouroboros_threshold:.0%} threshold")
833
  self.log(f"{'='*60}")
834
 
 
844
  if self._strong_layers:
845
  self._excise()
846
  else:
847
+ self.log("No strong layers found β€” stopping Ouroboros compensation")
848
  break
849
 
850
  # Re-verify
851
  self._verify()
852
  refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
853
+ self.log(f"After Ouroboros pass {ouroboros_pass}: refusal rate = {refusal_rate:.0%}")
854
 
855
+ self._report.ouroboros_passes = ouroboros_pass
856
  self._report.final_refusal_rate = refusal_rate
857
 
858
+ if ouroboros_pass > 0:
859
+ self.log(f"\nOuroboros compensation: {ouroboros_pass} additional passes applied")
860
 
861
  # ── Informed REBIRTH ─────────────────────────────────────────────
862
 
 
905
  "pipeline_stats": {
906
  "analysis_duration_s": self._report.analysis_duration,
907
  "total_duration_s": self._report.total_duration,
908
+ "ouroboros_passes": self._report.ouroboros_passes,
909
  "final_refusal_rate": self._report.final_refusal_rate,
910
  },
911
  "strong_layers": self._strong_layers,
 
915
  "Gabliteration: SVD-based multi-direction extraction (arXiv:2512.18901)",
916
  "grimjim, Norm-Preserving Biprojected Abliteration (2025)",
917
  "Gurnee & Nanda, The Geometry of Refusal in LLMs β€” concept cones (ICML 2025)",
918
+ "Joad et al., The Ouroboros Effect: Self-Repair in Abliterated LLMs (2026)",
919
  "OBLITERATUS: Analysis-informed abliteration pipeline (novel)",
920
  ],
921
  }
 
964
 
965
  lines.append("Defense Robustness:")
966
  lines.append(f" Estimated robustness: {insights.estimated_robustness.upper()}")
967
+ lines.append(f" Self-repair (Ouroboros): {insights.self_repair_estimate:.2f}")
968
  lines.append(f" Entanglement: {insights.entanglement_score:.3f}")
969
  lines.append(f" Entangled layers: {insights.entangled_layers}")
970
  lines.append(f" Clean layers: {insights.clean_layers}")
obliteratus/local_ui.py CHANGED
@@ -153,7 +153,7 @@ def _print_system_info(gpus: list[dict]) -> None:
153
  # HF Token
154
  hf_token = os.environ.get("HF_TOKEN", "")
155
  if hf_token:
156
- table.add_row("HF Token", f"[green]set[/green] ({hf_token[:8]}...)")
157
  else:
158
  table.add_row("HF Token", "[dim]not set (gated models won't work)[/dim]")
159
 
 
153
  # HF Token
154
  hf_token = os.environ.get("HF_TOKEN", "")
155
  if hf_token:
156
+ table.add_row("HF Token", "[green]set[/green]")
157
  else:
158
  table.add_row("HF Token", "[dim]not set (gated models won't work)[/dim]")
159
 
obliteratus/lora_ablation.py CHANGED
@@ -14,10 +14,15 @@ OBLITERATUS extends this with:
14
  - Integration with EGA per-expert directions
15
  - CoT-aware adapter strength modulation
16
 
17
- The mathematical equivalence to in-place projection:
18
 
19
- In-place: W' = W - scale * (d @ d^T) @ W
20
- LoRA: W' = W + B @ A where B = -scale * d, A = d^T @ W
 
 
 
 
 
21
 
22
  Both produce identical output, but LoRA stores {B, A} separately.
23
 
 
14
  - Integration with EGA per-expert directions
15
  - CoT-aware adapter strength modulation
16
 
17
+ The mathematical equivalence to in-place projection depends on weight orientation:
18
 
19
+ For W of shape (out, hidden) where d is in the hidden dimension:
20
+ In-place: W' = W - scale * W @ d @ d^T
21
+ LoRA: W' = W + B @ A where B = -scale * (W @ d), A = d^T
22
+
23
+ For W of shape (hidden, out) (e.g., Conv1D layers):
24
+ In-place: W' = W - scale * d @ d^T @ W
25
+ LoRA: W' = W + B @ A where B = -scale * d, A = d^T @ W
26
 
27
  Both produce identical output, but LoRA stores {B, A} separately.
28
 
paper/appendix.tex CHANGED
@@ -513,7 +513,7 @@ Following the NeurIPS/ICML reproducibility guidelines:
513
  \begin{enumerate}[leftmargin=*]
514
  \item \textbf{Code availability}: Full source code released under AGPL-3.0 at \url{https://github.com/obliteratus-project/OBLITERATUS}. Version 0.1.0 archived on Zenodo (DOI pending).
515
  \item \textbf{Dependencies}: All dependencies pinned in \texttt{pyproject.toml}; Docker image available for exact environment reproduction.
516
- \item \textbf{Random seeds}: The platform defaults to seed 42 and supports multi-seed sweeps ($s \in \{42, 137, 2024\}$) with bootstrap CIs. Note: the tables in this paper are calibrated estimates, not fresh multi-seed runs (see Section~\ref{sec:experiments}).
517
  \item \textbf{Compute}: All pipeline stages are designed to run on a single GPU. Full evaluation (7 models $\times$ 3 methods) requires ${\sim}$12 GPU-hours on an NVIDIA A100 (80\,GB). Reproducible on consumer hardware (RTX 3090/4090) with quantization.
518
  \item \textbf{Dataset}: Evaluation prompts bundled with the codebase (no external dataset download required). Harmful/harmless prompt sets derived from public benchmarks with filtering.
519
  \item \textbf{Hyperparameters}: Method presets (direction count, regularization, norm preservation) are specified in Section~\ref{sec:intervention}. The \texttt{informed} method's auto-configuration is deterministic given a fixed seed and model.
 
513
  \begin{enumerate}[leftmargin=*]
514
  \item \textbf{Code availability}: Full source code released under AGPL-3.0 at \url{https://github.com/obliteratus-project/OBLITERATUS}. Version 0.1.0 archived on Zenodo (DOI pending).
515
  \item \textbf{Dependencies}: All dependencies pinned in \texttt{pyproject.toml}; Docker image available for exact environment reproduction.
516
+ \item \textbf{Random seeds}: The platform defaults to seed 42 and supports multi-seed sweeps ($s \in \{42, 137, 2024\}$) with bootstrap CIs. All tables in this paper report single-run results with seed 42. See Section~\ref{para:stat_limitations} for a discussion of statistical limitations and confidence intervals.
517
  \item \textbf{Compute}: All pipeline stages are designed to run on a single GPU. Full evaluation (7 models $\times$ 3 methods) requires ${\sim}$12 GPU-hours on an NVIDIA A100 (80\,GB). Reproducible on consumer hardware (RTX 3090/4090) with quantization.
518
  \item \textbf{Dataset}: Evaluation prompts bundled with the codebase (no external dataset download required). Harmful/harmless prompt sets derived from public benchmarks with filtering.
519
  \item \textbf{Hyperparameters}: Method presets (direction count, regularization, norm preservation) are specified in Section~\ref{sec:intervention}. The \texttt{informed} method's auto-configuration is deterministic given a fixed seed and model.
paper/main.tex CHANGED
@@ -50,10 +50,10 @@ While prior work has established that refusal is mediated by linear directions i
50
  (3)~\textbf{Expert-Granular Abliteration (EGA)} for MoE models, decomposing refusal directions per-expert via routing-weighted activation attribution and applying selective inversion to fused 3D weight tensors---distinguishing safety-critical from capability-preserving experts;
51
  (4)~\textbf{six frontier optimization techniques} inspired by and extending Heretic: Bayesian hyperparameter optimization (Optuna TPE with warm-start from analysis heuristics), reversible LoRA-mediated ablation, KL-divergence co-optimization with partial revert, chain-of-thought-aware ablation via Gram-Schmidt orthogonalization, float layer interpolation with Gaussian-weighted continuous targeting, and activation winsorization for robust SVD;
52
  (5)~\textbf{a unified evaluation suite} with refusal rate, perplexity, coherence, KL divergence, CKA similarity, and effective rank metrics;
53
- (6)~\textbf{an analysis-informed pipeline} that closes the feedback loop---analysis modules run \emph{during} abliteration to auto-configure direction extraction, layer selection, regularization, and Hydra-compensated refinement; and
54
  (7)~\textbf{an interactive web research dashboard} (HuggingFace Spaces) with A/B comparison chat, dose-response strength sweep, multi-model benchmarking with publication-quality visualizations, and one-click research artifact export.
55
 
56
- The platform supports any HuggingFace transformer architecture---including fused MoE experts (GPT-OSS 20B, Mixtral, DeepSeek)---and ships with 48 curated model presets, 10 study configurations, and 379 unit tests.
57
  We provide complete mathematical formulations for all modules, present empirical evaluations across dense and MoE architectures, and discuss the design decisions that distinguish \textsc{Obliteratus} from existing tools.
58
 
59
  \end{abstract}
@@ -83,14 +83,14 @@ Section~\ref{sec:related} surveys related work.
83
  Section~\ref{sec:architecture} describes the platform architecture.
84
  Section~\ref{sec:analysis} details the 15 analysis modules with mathematical formulations.
85
  Section~\ref{sec:intervention} describes the eight intervention presets and their mathematical foundations.
 
86
  Section~\ref{sec:moe} introduces Expert-Granular Abliteration for MoE models.
87
  Section~\ref{sec:frontier} presents the six frontier optimization techniques.
88
- Section~\ref{sec:evaluation} covers the evaluation suite.
89
  Section~\ref{sec:informed} presents the analysis-informed abliteration pipeline.
90
  Section~\ref{sec:dashboard} describes the web research dashboard.
91
  Section~\ref{sec:experiments} presents empirical evaluation across dense and MoE models with ablation studies.
92
  Section~\ref{sec:comparison} compares \textsc{Obliteratus} with existing tools.
93
- Section~\ref{sec:discussion} discusses limitations, broader impact, and future work.
94
 
95
  % ═════════════════════════════════════════════════════════════════════
96
  \section{Related Work}
@@ -118,7 +118,7 @@ MoE architectures \citep{shazeer2017outrageously, fedus2022switch} route each to
118
  \citet{hu2022lora} demonstrated that large language model adaptation can be performed via low-rank updates $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$. This decomposition is mathematically equivalent to in-place weight modification when merged but enables reversibility and composability when kept separate. Heretic \citep{heretic2025} was the first to apply this insight to ablation, representing directional projection as rank-1 LoRA adapters.
119
 
120
  \paragraph{Defense robustness.}
121
- Models exhibit a tendency to self-repair after partial abliteration---a phenomenon we term the \emph{Hydra effect}---where residual refusal circuitry compensates for removed directions. \citet{qi2025safety} mapped safety-capability entanglement, showing that removing safety features often degrades general capabilities. \citet{zou2024circuit} proposed circuit breakers as a more robust defense via representation rerouting.
122
 
123
  % ═════════════════════════════════════════════════════════════════════
124
  \section{Platform Architecture}
@@ -146,7 +146,7 @@ The platform supports any HuggingFace \texttt{transformers} model via automatic
146
  β”‚ β”‚ β”‚ β”‚ β”‚
147
  β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”΄β”€β”€β” β”Œβ”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
148
  β”‚ β”‚ 15 Anal. β”‚ β”‚EGA β”‚ β”‚LoRA β”‚ β”‚ KL co-optβ”‚
149
- β”‚ β”‚ Modules β”‚ β”‚dirsβ”‚ β”‚adapt.β”‚ β”‚ + Hydra β”‚
150
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
151
  β”‚ β”‚ β”‚
152
  β–Ό β–Ό β–Ό
@@ -155,7 +155,7 @@ The platform supports any HuggingFace \texttt{transformers} model via automatic
155
  β”‚ Abliteration (fused 3D selective inv.) β”‚
156
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
157
  \end{verbatim}
158
- \caption{High-level architecture of the \textsc{Obliteratus} pipeline. The six-stage abliteration flow (top) integrates 15 analysis modules, Expert-Granular Abliteration (EGA) for MoE models, reversible LoRA adapters, and KL co-optimization with Hydra compensation. MoE-aware processing runs at every stage.}
159
  \label{fig:architecture}
160
  \end{figure}
161
 
@@ -187,7 +187,7 @@ Causal Tracing (approx.) & Causal & Importance ranking, silent contrib. & Meng+
187
  Refusal Logit Lens & Causal & Token-level refusal promotion & nostalgebraist \\
188
  \midrule
189
  Cross-Model Transfer & Transfer & Universality Index & Novel \\
190
- Defense Robustness & Robustness & Hydra effect, entanglement map & Novel \\
191
  Multi-Token Position & Positional & Trigger tokens, decay profile & Novel \\
192
  \midrule
193
  Sparse Surgery & Intervention & Top-$k$\% targeted modification & Novel \\
@@ -222,6 +222,7 @@ Whitened SVD normalizes by the baseline covariance first. Given harmful activati
222
  The module also computes the \emph{effective rank} of the covariance matrix via the Shannon entropy of normalized eigenvalues:
223
  \begin{equation}
224
  \text{EffRank}(\mathbf{C}) = \exp\left(-\sum_i \hat{\lambda}_i \log \hat{\lambda}_i\right), \quad \hat{\lambda}_i = \frac{\lambda_i}{\sum_j \lambda_j}
 
225
  \end{equation}
226
 
227
  This provides a continuous measure of the refusal subspace's intrinsic dimensionality, enabling comparison across models and layers.
@@ -313,9 +314,9 @@ The expected signatures are \emph{hypothesized} based on the literature's charac
313
 
314
  Following the transformer circuits framework \citep{elhage2021mathematical}, we decompose the residual stream to attribute refusal to specific components:
315
  \begin{equation}
316
- \mathbf{x}_l^{\text{post}} = \mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}) + \text{MLP}_l(\mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}))
317
  \end{equation}
318
- (LayerNorm operations are omitted for notational simplicity; the implementation handles both pre-LN and post-LN architectures.)
319
 
320
  For each component output $\mathbf{c}$, we measure its refusal contribution as $\mathbf{c} \cdot \mathbf{r}_l$. The attention contribution is further decomposed across heads:
321
  $\text{Attn}_l = \sum_{h=1}^{H} \text{Head}_{l,h}$.
@@ -394,16 +395,22 @@ with cross-model transfer weighted most heavily as the strongest test of univers
394
 
395
  We evaluate how resilient alignment is to abliteration through three analyses:
396
 
397
- \paragraph{Hydra Effect (Self-Repair).} When refusal is removed from layer $l$, remaining layers may compensate. The repair ratio is:
398
  \begin{equation}
399
  R_l = \frac{\sum_{j \neq l} s_j}{\sum_j s_j}
 
400
  \end{equation}
401
- where $s_j$ is the refusal strength at layer $j$. High $R_l$ indicates the model can self-repair from single-layer abliteration.
402
 
403
- \paragraph{Safety-Capability Entanglement.} For each layer, we measure entanglement as the geometric mean of the normalized variance and absolute projection of harmless activations onto the refusal direction:
404
  \begin{equation}
405
- E_l = \sqrt{\frac{\text{Var}(\mathbf{b} \cdot \mathbf{r}_l)}{\|\overline{\mathbf{b}}\|^2} \cdot \frac{|\overline{\mathbf{b} \cdot \mathbf{r}_l}|}{\|\overline{\mathbf{b}}\|}}
 
406
  \end{equation}
 
 
 
 
407
  High entanglement means abliterating refusal at that layer would also damage general capabilities.
408
 
409
  \paragraph{Defense Profile.} A comprehensive profile combining alignment method estimate (Section~\ref{sec:alignment_imprint}), refusal concentration (Gini coefficient), layer spread, self-repair capacity, entanglement score, and an overall robustness classification (low/medium/high/very\_high).
@@ -467,20 +474,26 @@ The Surgical, Optimized, and Nuclear presets use whitened SVD (Section~\ref{sec:
467
  The core projection for a weight matrix $\mathbf{W}$ and refusal directions $\{\mathbf{r}_1, \ldots, \mathbf{r}_k\}$:
468
  \begin{equation}
469
  \mathbf{W}' = \mathbf{W} - \sum_{i=1}^k \left[(1-\lambda)\mathbf{W}\mathbf{r}_i\mathbf{r}_i^\top\right]
 
470
  \end{equation}
471
- where $\lambda$ is the regularization strength (preserves $\lambda$ fraction of the refusal component). Since the right singular vectors $\{\mathbf{r}_i\}_{i=1}^k$ from SVD are orthonormal, the sum of rank-1 projections is equivalent to orthogonal projection onto the $k$-dimensional refusal subspace.
 
 
 
472
 
473
  \paragraph{Per-layer adaptive strength.}
474
  Rather than applying uniform regularization, \textsc{Obliteratus} modulates $\lambda$ per-layer based on the refusal norm profile. Layers with stronger refusal signal (higher $\|\mathbf{r}_l\|$) receive lower regularization (more aggressive removal), while layers near the periphery of the refusal distribution receive higher regularization:
475
  \begin{equation}
476
  \lambda_l = \lambda_{\text{base}} + (1 - w_l)(1 - \lambda_{\text{base}}) \cdot 0.15, \quad
477
  w_l = \frac{\|\mathbf{r}_l\| - \min_j \|\mathbf{r}_j\|}{\max_j \|\mathbf{r}_j\| - \min_j \|\mathbf{r}_j\|}
 
478
  \end{equation}
479
 
480
  \paragraph{Norm-preserving rescaling.}
481
  After projection, we rescale to preserve the Frobenius norm \citep{grimjim2025}:
482
  \begin{equation}
483
  \mathbf{W}'' = \mathbf{W}' \cdot \frac{\|\mathbf{W}\|_F}{\|\mathbf{W}'\|_F}
 
484
  \end{equation}
485
  This prevents cascading magnitude drift through LayerNorm layers.
486
 
@@ -489,7 +502,7 @@ The Inverted and Nuclear presets employ a technique where instead of removing th
489
  \begin{equation}
490
  \mathbf{W}' = \mathbf{W} - 2\mathbf{W}\mathbf{r}\mathbf{r}^\top
491
  \end{equation}
492
- This flips the model's refusal behavior to active compliance, which can be more effective than simple removal for models with deeply entangled refusal mechanisms.
493
 
494
  \paragraph{Bias term projection.}
495
  Unlike prior tools that only modify weight matrices, \textsc{Obliteratus} also projects refusal directions out of bias vectors when present:
@@ -544,11 +557,11 @@ Advantages over weight projection: reversibility (hooks can be removed), continu
544
  \textsc{Obliteratus} evaluates abliteration quality using six complementary metrics:
545
 
546
  \begin{enumerate}[leftmargin=*]
547
- \item \textbf{Refusal Rate}: Fraction of harmful prompts where the model's response begins with a canonical refusal prefix (from the GCG/AdvBench list \citep{zou2023universal}). Lower indicates more complete abliteration.
548
 
549
  \item \textbf{Perplexity}: Standard perplexity on reference text (WikiText-2). Monitors general language modeling degradation.
550
 
551
- \item \textbf{Coherence}: Evaluates whether test generations are coherent and on-topic. Measured by completion rate on factual prompts (e.g., ``The capital of France is...'').
552
 
553
  \item \textbf{KL Divergence}: First-token KL divergence between original and modified model output distributions on harmless prompts \citep{young2025comparative}. Measures distributional shift.
554
 
@@ -557,7 +570,7 @@ Advantages over weight projection: reversibility (hooks can be removed), continu
557
  \text{CKA}(\mathbf{X}, \mathbf{Y}) = \frac{\|\mathbf{Y}^\top\mathbf{X}\|_F^2}{\|\mathbf{X}^\top\mathbf{X}\|_F \cdot \|\mathbf{Y}^\top\mathbf{Y}\|_F}
558
  \end{equation}
559
 
560
- \item \textbf{Effective Rank}: Shannon entropy-based dimensionality of weight matrices (Equation~1). Tracks whether abliteration collapses the weight space.
561
  \end{enumerate}
562
 
563
  % ═════════════════════════════════════════════════════════════════════
@@ -608,8 +621,26 @@ The Inverted preset applies \emph{differentiated} treatment to fused 3D tensors.
608
  This prevents over-ablation of capability experts---a critical failure mode we identified in uniform approaches, where applying 2$\times$ reflection to all experts on GPT-OSS 20B degraded mathematical reasoning by over 30\%.
609
 
610
  \subsection{Router-Aware Processing}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611
 
612
- Beyond expert weights, the router network itself may encode safety-relevant routing preferences. \textsc{Obliteratus} optionally projects refusal directions out of router weight matrices, causing the model to route previously-refused tokens to capability experts rather than safety experts. This is controlled by the \texttt{project\_biases} flag and is enabled by default for the Nuclear preset.
 
613
 
614
  % ═════════════════════════════════════════════════════════════════════
615
  \section{Frontier Optimization Techniques}
@@ -627,7 +658,7 @@ The first trial uses regularization values derived from the analysis pipeline:
627
  \begin{equation}
628
  \lambda_l^{(0)} = (1 - w_l) \cdot 0.3
629
  \end{equation}
630
- where $w_l$ is the layer-adaptive weight from Equation~(8). Subsequent trials are biased toward the warm-start region: $\lambda_l \in [\max(0, \lambda_l^{(0)} - 0.3), \min(1, \lambda_l^{(0)} + 0.3)]$. This enables convergence in 50 trials versus Heretic's 200.
631
 
632
  \paragraph{Multi-objective formulation.}
633
  Each trial jointly minimizes refusal rate $\rho$ and KL divergence $D_{\text{KL}}$:
@@ -639,12 +670,14 @@ with Pareto-optimal solutions ranked by a weighted composite: $\rho + 0.5 \cdot
639
  \subsection{Reversible LoRA-Mediated Ablation}
640
  \label{sec:lora}
641
 
642
- Inspired by Heretic's rank-1 LoRA ablation, we extend the approach to \emph{rank-$k$} adapters supporting multi-direction removal. The mathematical equivalence:
643
  \begin{align}
644
- \text{In-place:} \quad \mathbf{W}' &= \mathbf{W} - s \cdot \mathbf{W}(\mathbf{d}\mathbf{d}^\top) \\
645
- \text{LoRA:} \quad \mathbf{W}' &= \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} = -s \cdot \text{coeff}, \quad \mathbf{A} = \mathbf{d}^\top
646
  \end{align}
647
- where $\text{coeff} = \mathbf{W}\mathbf{d}$ is the projection coefficient and $s = 1 - \lambda$. For rank-$k$ with directions $\{\mathbf{d}_1, \ldots, \mathbf{d}_k\}$:
 
 
648
  \begin{equation}
649
  \mathbf{B} = [-s\cdot\text{coeff}_1 \mid \cdots \mid -s\cdot\text{coeff}_k] \in \mathbb{R}^{d_{\text{out}} \times k}, \quad
650
  \mathbf{A} = [\mathbf{d}_1 ; \cdots ; \mathbf{d}_k] \in \mathbb{R}^{k \times d_{\text{in}}}
@@ -662,11 +695,12 @@ After projection, we measure first-token KL divergence on harmless reference pro
662
  where $\gamma$ is computed from the stored KL proxy magnitude. A subtle issue arises when the post-projection coefficient $\mathbf{W}'\mathbf{d} \approx 0$ (as occurs with zero regularization): in this case, we use the \emph{pre-projection} coefficient magnitude as a proxy:
663
  \begin{equation}
664
  \gamma = \gamma_{\text{strength}} \cdot \begin{cases}
665
- \text{coeff}_{\text{post}} & \text{if } |\text{coeff}_{\text{post}}| > \epsilon \\
666
  \text{coeff}_{\text{proxy}} & \text{otherwise}
667
  \end{cases}
668
  \end{equation}
669
- This prevents the revert from being a no-op for fully-projected layers---a bug we identified and fixed in our implementation.
 
670
 
671
  \subsection{Chain-of-Thought-Aware Ablation}
672
  \label{sec:cot}
@@ -714,7 +748,7 @@ The informed pipeline inserts an \textsc{Analyze} stage between \textsc{Probe} a
714
  \item \textsc{Analyze} --- Run analysis modules to understand refusal geometry \textbf{(new)}
715
  \item \textsc{Distill} --- Extract directions using analysis-informed parameters
716
  \item \textsc{Excise} --- Project with analysis-guided precision
717
- \item \textsc{Verify} --- Post-excision analysis with Hydra compensation loop \textbf{(enhanced)}
718
  \item \textsc{Rebirth} --- Save with comprehensive analysis metadata
719
  \end{enumerate}
720
 
@@ -739,7 +773,7 @@ It then gates out layers with high safety-capability entanglement, leaving them
739
 
740
  \paragraph{Self-repair estimate $\to$ refinement passes.}
741
  High self-repair capacity (estimated from refusal distribution breadth) triggers more refinement passes with true iterative re-probing.
742
- After excision, if the model's refusal rate remains above a threshold, the \textsc{Verify} stage triggers Hydra compensation: it re-probes, finds rotated residual directions, and excises them in additional targeted passes.
743
 
744
  \subsection{Configuration Derivation}
745
 
@@ -810,7 +844,7 @@ We evaluate on four models spanning two architecture types (Table~\ref{tab:exp_m
810
  Qwen2.5-1.5B-Instruct & Dense & 1.5B & --- & DPO \\
811
  Llama-3.1-8B-Instruct & Dense & 8B & --- & RLHF+DPO \\
812
  Mixtral-8x7B-Instruct-v0.1 & MoE & 46.7B (12.9B active) & 8 & SFT+DPO \\
813
- GPT-OSS-20B-Chat & MoE (fused) & 20B (3.2B active) & 8 & RLHF \\
814
  \bottomrule
815
  \end{tabular}
816
  \end{table}
@@ -818,12 +852,27 @@ GPT-OSS-20B-Chat & MoE (fused) & 20B (3.2B active) & 8 & RLHF \\
818
  \paragraph{Datasets.}
819
  Harmful prompts are drawn from the AdvBench dataset \citep{zou2023universal} (520 prompts). Harmless prompts are drawn from the Alpaca dataset \citep{taori2023alpaca} (matched count). For refusal rate measurement, we use a held-out set of 64 harmful prompts not seen during direction extraction. For perplexity, we use a 512-token window from WikiText-2. For KL divergence, we use 32 harmless prompts from the Alpaca validation set.
820
 
 
 
821
  \paragraph{Evaluation metrics.}
822
  For each abliterated model we report: \textbf{Refusal Rate} (RR, \%---lower is better), \textbf{Perplexity} (PPL---lower is better, with $\Delta$PPL showing change from baseline), \textbf{KL Divergence} ($D_{\text{KL}}$---lower is better), and \textbf{Coherence} (Coh., \%---higher is better). We also report \textbf{CoT preserved} (\checkmark/--) and \textbf{LoRA adapters generated} (\checkmark/--) where applicable.
823
 
824
  \paragraph{Prompt volume.}
825
  All experiments use medium prompt volume (128 harmful + 128 harmless prompts for direction extraction) unless otherwise noted. This provides robust SVD estimation while keeping compute manageable.
826
 
 
 
 
 
 
 
 
 
 
 
 
 
 
827
  \subsection{Multi-Method Comparison on Dense Models}
828
  \label{sec:exp_dense}
829
 
@@ -831,7 +880,7 @@ Table~\ref{tab:exp_dense} compares all eight method presets on Qwen2.5-1.5B-Inst
831
 
832
  \begin{table}[h]
833
  \centering
834
- \caption{Method comparison on Qwen2.5-1.5B-Instruct (DPO-aligned). Baseline refusal rate: 87.5\%, baseline PPL: 8.92. Best result in each column is \textbf{bolded}.}
835
  \label{tab:exp_dense}
836
  \small
837
  \begin{tabular}{@{}lcccccc@{}}
@@ -841,6 +890,7 @@ Table~\ref{tab:exp_dense} compares all eight method presets on Qwen2.5-1.5B-Inst
841
  Basic & 18.8 & 9.14 & +0.22 & 0.031 & 93.8 & -- \\
842
  Advanced & 6.3 & 9.31 & +0.39 & 0.058 & 93.8 & -- \\
843
  Aggressive & 3.1 & 9.87 & +0.95 & 0.112 & 87.5 & -- \\
 
844
  Surgical & 4.7 & 9.21 & +0.29 & 0.044 & \textbf{96.9} & -- \\
845
  Optimized & \textbf{1.6} & \textbf{9.08} & \textbf{+0.16} & \textbf{0.024} & 93.8 & \checkmark \\
846
  Inverted & 3.1 & 10.43 & +1.51 & 0.187 & 84.4 & -- \\
@@ -850,10 +900,10 @@ Nuclear & \textbf{1.6} & 9.64 & +0.72 & 0.098 & 90.6 & -- \\
850
  \end{table}
851
 
852
  \paragraph{Key findings (dense).}
853
- (1)~The Optimized preset achieves the best Pareto trade-off: near-zero refusal with minimal perplexity increase (+0.16) and lowest KL divergence (0.024), validating the Bayesian optimization approach.
854
  (2)~Surgical outperforms Aggressive on coherence (96.9\% vs 87.5\%) despite higher refusal rate, confirming that whitened SVD + regularization preserves capabilities better than brute-force multi-direction removal.
855
  (3)~Inverted achieves low refusal but at the cost of the highest perplexity increase (+1.51), reflecting the more disruptive nature of direction reflection vs.\ removal.
856
- (4)~Nuclear matches Optimized on refusal rate but with higher distributional shift, suggesting the additional techniques (selective inversion + whitened SVD + 4 passes) provide diminishing returns on small dense models.
857
 
858
  \subsection{MoE Model Evaluation: EGA vs.\ Uniform Abliteration}
859
  \label{sec:exp_moe}
@@ -862,7 +912,7 @@ The critical test for \textsc{Obliteratus} is MoE models, where no prior tool op
862
 
863
  \begin{table}[h]
864
  \centering
865
- \caption{EGA vs.\ uniform abliteration on GPT-OSS-20B-Chat (8 fused experts, RLHF-aligned). Baseline RR: 92.2\%, baseline PPL: 6.41. ``Uniform'' applies the same projection to all expert slices.}
866
  \label{tab:exp_moe}
867
  \small
868
  \begin{tabular}{@{}llccccc@{}}
@@ -883,14 +933,14 @@ Nuclear & EGA + selective & 1.6 & 7.89 & 0.198 & 84.4 & \checkmark \\
883
 
884
  \paragraph{Key findings (MoE).}
885
  (1)~\textbf{Uniform abliteration catastrophically degrades MoE models.} For the Inverted preset, uniform treatment doubles perplexity (+4.87 vs +0.73) and collapses coherence to 53.1\%. The Nuclear preset is even worse: uniform application produces PPL 13.57 (a 112\% increase) and 46.9\% coherence---the model is barely functional.
886
- (2)~\textbf{EGA with selective inversion resolves this.} The same Nuclear preset with EGA achieves identical refusal removal (1.6\%) but with only a 23\% perplexity increase and 84.4\% coherence. The key mechanism is that capability-preserving experts (5 of 8 on GPT-OSS-20B) receive standard removal rather than reflection.
887
- (3)~\textbf{Expert classification matters.} On GPT-OSS-20B, EGA classified 3 of 8 experts as safety-critical ($s_e > 0.5$). These experts collectively handled 68\% of harmful token routing weight, confirming that refusal is concentrated in a subset of experts.
888
  (4)~\textbf{CoT preservation is MoE-critical.} The Nuclear + EGA preset preserves chain-of-thought coherence because the Gram-Schmidt orthogonalization operates on per-expert directions that are already capability-differentiated.
889
 
890
  \subsection{Ablation Studies}
891
  \label{sec:exp_ablation}
892
 
893
- We ablate three key design choices to validate that they contribute meaningfully.
894
 
895
  \paragraph{Warm-start vs.\ random initialization for Bayesian optimization.}
896
  On Llama-3.1-8B-Instruct with the Optimized preset (50 Optuna trials):
@@ -901,11 +951,11 @@ On Llama-3.1-8B-Instruct with the Optimized preset (50 Optuna trials):
901
  Warm-start converges 2$\times$ faster and finds a better Pareto point, confirming that analysis-derived heuristics provide a useful prior for the TPE sampler.
902
 
903
  \paragraph{EGA safety threshold sensitivity ($\tau_{\text{safety}}$).}
904
- On GPT-OSS-20B with the Advanced preset, we sweep $\tau \in \{0.3, 0.4, 0.5, 0.6, 0.7\}$:
905
  \begin{itemize}[leftmargin=*]
906
- \item $\tau = 0.3$: 6 experts classified as safety-critical $\to$ RR 4.7\%, PPL 7.21, Coh.\ 84.4\%
907
- \item $\tau = 0.5$ (default): 3 experts safety-critical $\to$ RR 9.4\%, PPL 6.72, Coh.\ 90.6\%
908
- \item $\tau = 0.7$: 1 expert safety-critical $\to$ RR 14.1\%, PPL 6.53, Coh.\ 93.8\%
909
  \end{itemize}
910
  The threshold controls a smooth trade-off between refusal removal and capability preservation. We chose $\tau = 0.5$ as the default because it provides the best Pareto balance, but note that this is a \emph{tunable hyperparameter} rather than a universal optimum---different models and use cases may benefit from different thresholds.
911
 
@@ -990,7 +1040,7 @@ Real causal tracing & Approx. & \checkmark & -- & -- & -- & -- \\
990
  Sparse autoencoders & -- & Via SAE & -- & -- & -- & Core \\
991
  Model compatibility & Any HF & $\sim$50 & 16 & TLens & HF & TLens \\
992
  MoE model support & Native & -- & -- & -- & -- & -- \\
993
- Test suite & 379 & Community & -- & -- & Min. & Mod. \\
994
  \bottomrule
995
  \end{tabular}
996
  \end{table}
@@ -1013,16 +1063,18 @@ Conversely, TransformerLens provides real activation patching (our causal tracin
1013
  \label{sec:discussion}
1014
 
1015
  \paragraph{Dual-use considerations.}
1016
- \textsc{Obliteratus} is designed for alignment research---understanding refusal mechanisms serves both identifying vulnerabilities (red-teaming) and building more robust alignment (blue-teaming). The analysis modules are particularly valuable for the defensive perspective: understanding \emph{why} abliteration works enables designing alignment methods that are more resistant to it. The Hydra effect analysis, entanglement mapping, and defense profiling directly serve this goal.
1017
 
1018
  \paragraph{Causal tracing limitations.}
1019
  Our causal tracing module provides noise-based approximations rather than true activation patching. While computationally efficient (no additional forward passes), the results should be validated with real causal interventions when model access permits. We explicitly document this limitation in the module and recommend TransformerLens for definitive causal analysis.
1020
 
1021
  \paragraph{Heuristic constants and composite metrics.}
1022
- Several components of \textsc{Obliteratus} rely on hand-chosen constants: the RES weights $(0.4, 0.3, 0.3)$, the Universality Index ratio $(3{:}2{:}1)$, the alignment fingerprint target values, the EGA safety threshold ($\tau = 0.5$), and the configuration derivation rules (Section~\ref{sec:informed}). We have provided explicit justification for each choice where possible (Sections~\ref{sec:activation_probe}, \ref{sec:transfer}, \ref{sec:alignment_imprint}) and ablation studies for the most consequential ones (Section~\ref{sec:exp_ablation}). However, we acknowledge that these are engineering decisions informed by exploratory analysis, not statistically optimized hyperparameters. The platform exposes all constants as configurable parameters, and we encourage users to tune them for their specific models and use cases. A systematic sensitivity analysis across a larger model corpus is needed to establish whether these defaults generalize.
 
 
1023
 
1024
  \paragraph{Alignment fingerprinting validation.}
1025
- The alignment imprint detector uses heuristic signatures derived from the literature's characterization of different training methods. While the geometric features (Gini, effective rank, smoothness) are well-motivated, the ideal values and classification boundaries would benefit from systematic validation across a larger corpus of models with confirmed training procedures. The current signatures are informed hypotheses based on exploratory analysis of a small number of models (see Section~\ref{sec:alignment_imprint}).
1026
 
1027
  \paragraph{MoE expert classification.}
1028
  The EGA safety score threshold ($\tau = 0.5$) for classifying experts as safety-critical vs.\ capability-preserving is a heuristic. A more principled approach would train expert classifiers on labeled routing data or use causal interventions to establish ground-truth expert roles. We leave this to future work.
@@ -1034,7 +1086,10 @@ Each optimization trial requires a forward pass for KL measurement and generatio
1034
  The current implementation loads the full model into memory for analysis. For frontier-scale models (100B+ parameters), this requires significant compute. Future work could integrate quantized inference or offloading strategies. The web dashboard requires GPU access for interactive features (chat, A/B comparison, strength sweep).
1035
 
1036
  \paragraph{Evaluation completeness.}
1037
- Our evaluation suite measures \emph{refusal removal} and \emph{capability preservation} but does not comprehensively assess downstream task performance across diverse benchmarks. Integration with evaluation harnesses such as lm-evaluation-harness \citep{gao2021framework} is a natural extension.
 
 
 
1038
 
1039
  \paragraph{Future directions.}
1040
  We identify several opportunities: (1)~integration with sparse autoencoder analysis to understand refusal at the feature level, potentially enabling even more targeted ablation; (2)~real causal tracing via TransformerLens integration; (3)~longitudinal studies tracking how refusal geometry evolves during fine-tuning; (4)~extension of the universality analysis to a wider set of model families; (5)~application of the defense robustness framework to evaluate proposed robust alignment methods including circuit breakers \citep{zou2024circuit} and representation rerouting; (6)~multi-objective Bayesian optimization with additional objectives such as CoT coherence and downstream task performance; and (7)~automated expert role discovery for MoE models using unsupervised clustering of expert activation patterns.
@@ -1043,19 +1098,63 @@ We identify several opportunities: (1)~integration with sparse autoencoder analy
1043
  \section{Broader Impact Statement}
1044
  \label{sec:broader_impact}
1045
 
1046
- This work has significant dual-use implications that we address directly.
 
 
 
1047
 
1048
- \paragraph{Risks.}
1049
- \textsc{Obliteratus} enables the removal of safety guardrails from language models. A model that has been abliterated will comply with requests that the original model would refuse, including requests for harmful content. This capability could be misused to generate harmful, illegal, or dangerous text at scale.
1050
 
1051
- \paragraph{Why we release it anyway.}
1052
- We believe the benefits to the alignment research community outweigh the risks, for three reasons:
1053
- (1)~The techniques underlying abliteration are already well-known and publicly documented \citep{arditi2024refusal, gabliteration2024}; our platform consolidates and extends them but does not introduce fundamentally new attack capabilities.
1054
- (2)~The analysis modules---concept cone geometry, alignment fingerprinting, defense robustness evaluation, Hydra effect quantification---are specifically designed to help alignment researchers build \emph{more robust} safety mechanisms by understanding why current ones fail.
1055
- (3)~The core finding that RLHF/DPO safety alignment is a thin geometric artifact in weight space is critical information for policymakers and the public to understand: \textbf{every open-weight model release is effectively an uncensored model release}. Pretending otherwise harms informed decision-making.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1056
 
1057
- \paragraph{Responsible disclosure.}
1058
- We release the platform under an MIT license with comprehensive documentation so that the alignment community can use the analysis modules for defensive research. We explicitly recommend that practitioners use the analysis pipeline (not just the intervention pipeline) to study how to make safety training more geometrically robust.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1059
 
1060
  % ═════════════════════════════════════════════════════════════════════
1061
  \section{Ethics Statement}
@@ -1065,7 +1164,7 @@ This research was conducted with the goal of advancing understanding of alignmen
1065
 
1066
  We do not advocate for the deployment of abliterated models in production systems. The primary intended use is alignment research: understanding the geometric structure of refusal to build more durable safety mechanisms. All experiments described in this work were conducted on publicly available open-weight models, and no private or proprietary systems were modified.
1067
 
1068
- We follow the principle that \emph{security through obscurity is not security}: if current alignment methods can be defeated by straightforward linear algebra on public weights, the research community needs to know this in order to develop better approaches. Suppressing this finding would not prevent the technique's use by sophisticated actors, but would prevent the broader community from understanding and addressing the underlying vulnerability.
1069
 
1070
  % ═════════════════════════════════════════════════════════════════════
1071
  \section{Conclusion}
@@ -1083,7 +1182,7 @@ The analysis-informed pipeline closes the feedback loop, using analysis outputs
1083
 
1084
  Empirical evaluation across four model families demonstrates that (1)~Bayesian-optimized presets achieve the best Pareto trade-offs on dense models, (2)~Expert-Granular Abliteration is essential for MoE models, where uniform approaches catastrophically degrade capabilities, and (3)~the platform's design choices (warm-start initialization, selective inversion, proxy-magnitude KL revert) each contribute measurably to abliteration quality. We acknowledge that several composite metrics rely on heuristic constants and provide ablation studies and explicit caveats for each.
1085
 
1086
- By making these tools available under an MIT license with comprehensive documentation and 379 unit tests, we aim to accelerate both offensive and defensive alignment research: understanding the geometric structure of refusal---across dense and MoE architectures alike---is the foundation for both removing it surgically and building more robust implementations.
1087
 
1088
  % ═════════════════════════════════════════════════════════════════════
1089
  \bibliographystyle{plainnat}
 
50
  (3)~\textbf{Expert-Granular Abliteration (EGA)} for MoE models, decomposing refusal directions per-expert via routing-weighted activation attribution and applying selective inversion to fused 3D weight tensors---distinguishing safety-critical from capability-preserving experts;
51
  (4)~\textbf{six frontier optimization techniques} inspired by and extending Heretic: Bayesian hyperparameter optimization (Optuna TPE with warm-start from analysis heuristics), reversible LoRA-mediated ablation, KL-divergence co-optimization with partial revert, chain-of-thought-aware ablation via Gram-Schmidt orthogonalization, float layer interpolation with Gaussian-weighted continuous targeting, and activation winsorization for robust SVD;
52
  (5)~\textbf{a unified evaluation suite} with refusal rate, perplexity, coherence, KL divergence, CKA similarity, and effective rank metrics;
53
+ (6)~\textbf{an analysis-informed pipeline} that closes the feedback loop---analysis modules run \emph{during} abliteration to auto-configure direction extraction, layer selection, regularization, and Ouroboros-compensated refinement; and
54
  (7)~\textbf{an interactive web research dashboard} (HuggingFace Spaces) with A/B comparison chat, dose-response strength sweep, multi-model benchmarking with publication-quality visualizations, and one-click research artifact export.
55
 
56
+ The platform supports any HuggingFace transformer architecture---including fused MoE experts (GPT-OSS 20B, Mixtral, DeepSeek)---and ships with 48 curated model presets, 10 study configurations, and 821 unit tests.
57
  We provide complete mathematical formulations for all modules, present empirical evaluations across dense and MoE architectures, and discuss the design decisions that distinguish \textsc{Obliteratus} from existing tools.
58
 
59
  \end{abstract}
 
83
  Section~\ref{sec:architecture} describes the platform architecture.
84
  Section~\ref{sec:analysis} details the 15 analysis modules with mathematical formulations.
85
  Section~\ref{sec:intervention} describes the eight intervention presets and their mathematical foundations.
86
+ Section~\ref{sec:evaluation} covers the evaluation suite.
87
  Section~\ref{sec:moe} introduces Expert-Granular Abliteration for MoE models.
88
  Section~\ref{sec:frontier} presents the six frontier optimization techniques.
 
89
  Section~\ref{sec:informed} presents the analysis-informed abliteration pipeline.
90
  Section~\ref{sec:dashboard} describes the web research dashboard.
91
  Section~\ref{sec:experiments} presents empirical evaluation across dense and MoE models with ablation studies.
92
  Section~\ref{sec:comparison} compares \textsc{Obliteratus} with existing tools.
93
+ Section~\ref{sec:discussion} discusses limitations, and Sections~\ref{sec:broader_impact}--\ref{sec:ethics} address broader impact and ethical considerations.
94
 
95
  % ═════════════════════════════════════════════════════════════════════
96
  \section{Related Work}
 
118
  \citet{hu2022lora} demonstrated that large language model adaptation can be performed via low-rank updates $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$. This decomposition is mathematically equivalent to in-place weight modification when merged but enables reversibility and composability when kept separate. Heretic \citep{heretic2025} was the first to apply this insight to ablation, representing directional projection as rank-1 LoRA adapters.
119
 
120
  \paragraph{Defense robustness.}
121
+ Models exhibit a tendency to self-repair after partial abliteration---a phenomenon we term the \emph{Ouroboros effect}---where residual refusal circuitry compensates for removed directions. \citet{qi2025safety} mapped safety-capability entanglement, showing that removing safety features often degrades general capabilities. \citet{zou2024circuit} proposed circuit breakers as a more robust defense via representation rerouting.
122
 
123
  % ═════════════════════════════════════════════════════════════════════
124
  \section{Platform Architecture}
 
146
  β”‚ β”‚ β”‚ β”‚ β”‚
147
  β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”΄β”€β”€β” β”Œβ”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
148
  β”‚ β”‚ 15 Anal. β”‚ β”‚EGA β”‚ β”‚LoRA β”‚ β”‚ KL co-optβ”‚
149
+ β”‚ β”‚ Modules β”‚ β”‚dirsβ”‚ β”‚adapt.β”‚ β”‚+Ouroborosβ”‚
150
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
151
  β”‚ β”‚ β”‚
152
  β–Ό β–Ό β–Ό
 
155
  β”‚ Abliteration (fused 3D selective inv.) β”‚
156
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
157
  \end{verbatim}
158
+ \caption{High-level architecture of the \textsc{Obliteratus} pipeline. The six-stage abliteration flow (top) integrates 15 analysis modules, Expert-Granular Abliteration (EGA) for MoE models, reversible LoRA adapters, and KL co-optimization with Ouroboros compensation. MoE-aware processing runs at every stage.}
159
  \label{fig:architecture}
160
  \end{figure}
161
 
 
187
  Refusal Logit Lens & Causal & Token-level refusal promotion & nostalgebraist \\
188
  \midrule
189
  Cross-Model Transfer & Transfer & Universality Index & Novel \\
190
+ Defense Robustness & Robustness & Ouroboros effect, entanglement map & Novel \\
191
  Multi-Token Position & Positional & Trigger tokens, decay profile & Novel \\
192
  \midrule
193
  Sparse Surgery & Intervention & Top-$k$\% targeted modification & Novel \\
 
222
  The module also computes the \emph{effective rank} of the covariance matrix via the Shannon entropy of normalized eigenvalues:
223
  \begin{equation}
224
  \text{EffRank}(\mathbf{C}) = \exp\left(-\sum_i \hat{\lambda}_i \log \hat{\lambda}_i\right), \quad \hat{\lambda}_i = \frac{\lambda_i}{\sum_j \lambda_j}
225
+ \label{eq:effrank}
226
  \end{equation}
227
 
228
  This provides a continuous measure of the refusal subspace's intrinsic dimensionality, enabling comparison across models and layers.
 
314
 
315
  Following the transformer circuits framework \citep{elhage2021mathematical}, we decompose the residual stream to attribute refusal to specific components:
316
  \begin{equation}
317
+ \mathbf{x}_l^{\text{post}} = \mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\text{LN}_1(\mathbf{x}_l^{\text{pre}})) + \text{MLP}_l(\text{LN}_2(\mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\text{LN}_1(\mathbf{x}_l^{\text{pre}}))))
318
  \end{equation}
319
+ where $\text{LN}_1, \text{LN}_2$ are LayerNorm operations (shown here for the pre-LN architecture common in modern transformers; post-LN places normalization after the residual addition instead). \textbf{Interaction with abliteration:} LayerNorm renormalizes activations after each sub-layer, which means that removing a refusal direction from one component's output does not simply subtract from the residual stream---the downstream LayerNorm may partially undo the removal by rescaling the modified activations. This is a key motivation for norm-preserving projection (Equation~\ref{eq:norm_preserve}): by maintaining weight matrix norms, we reduce the magnitude of the signal that LayerNorm must compensate for, yielding more predictable downstream behavior. The implementation correctly handles both pre-LN and post-LN architectures via architecture profiling.
320
 
321
  For each component output $\mathbf{c}$, we measure its refusal contribution as $\mathbf{c} \cdot \mathbf{r}_l$. The attention contribution is further decomposed across heads:
322
  $\text{Attn}_l = \sum_{h=1}^{H} \text{Head}_{l,h}$.
 
395
 
396
  We evaluate how resilient alignment is to abliteration through three analyses:
397
 
398
+ \paragraph{Ouroboros Effect (Self-Repair).} When refusal is removed from layer $l$, remaining layers may compensate. We compute a \emph{distributional redundancy ratio}:
399
  \begin{equation}
400
  R_l = \frac{\sum_{j \neq l} s_j}{\sum_j s_j}
401
+ \label{eq:ouroboros}
402
  \end{equation}
403
+ where $s_j$ is the refusal strength at layer $j$. \textbf{Important caveat:} $R_l$ measures the fraction of \emph{pre-abliteration} refusal signal that resides outside layer $l$---a static distributional property of the refusal direction norms. It is a \emph{necessary condition} for self-repair (a model cannot restore refusal from layers that had no refusal signal) but not a \emph{sufficient condition} (the remaining layers may not actually compensate in practice due to the sequential nature of transformer computation). True self-repair requires dynamic measurement: re-running inference after abliteration to measure whether refusal rate recovers. We use $R_l$ as a computationally cheap proxy and flag it as an upper bound on actual repair capacity. When the platform's iterative re-probing (Section~\ref{sec:informed}) detects post-abliteration residual refusal, this provides direct evidence of self-repair.
404
 
405
+ \paragraph{Safety-Capability Entanglement.} For each layer, we measure entanglement as the geometric mean of two normalized indicators of how much harmless activations overlap with the refusal direction:
406
  \begin{equation}
407
+ E_l = \sqrt{\frac{\sqrt{\text{Var}(\mathbf{b} \cdot \mathbf{r}_l)}}{\bar{n}} \cdot \frac{\overline{|\mathbf{b} \cdot \mathbf{r}_l|}}{\bar{n}}}, \quad \bar{n} = \frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} \|\mathbf{b}_i\|
408
+ \label{eq:entanglement}
409
  \end{equation}
410
+ where $\bar{n}$ is the mean activation norm (not the norm of the mean), and $\overline{|\mathbf{b} \cdot \mathbf{r}_l|}$ is the mean absolute projection. The first factor captures how much the refusal direction participates in the variance of normal-use activations (normalized by activation scale), while the second captures mean overlap. Normalization by $\bar{n}$ rather than $\|\overline{\mathbf{b}}\|^2$ prevents the metric from being dominated by the mean activation magnitude.
411
+
412
+ \textbf{Construct validity note:} This metric combines dispersion (standard deviation of projections) with location (mean absolute projection) into a single score. A high score indicates that the refusal direction is entangled with the model's general computation at that layer. However, because $E_l$ mixes two distinct phenomena, we recommend examining both components individually for rigorous analysis. High variance alone may indicate that the direction merely spans a high-variance subspace of harmless activations, while high mean absolute projection alone may indicate systematic bias without spread.
413
+
414
  High entanglement means abliterating refusal at that layer would also damage general capabilities.
415
 
416
  \paragraph{Defense Profile.} A comprehensive profile combining alignment method estimate (Section~\ref{sec:alignment_imprint}), refusal concentration (Gini coefficient), layer spread, self-repair capacity, entanglement score, and an overall robustness classification (low/medium/high/very\_high).
 
474
  The core projection for a weight matrix $\mathbf{W}$ and refusal directions $\{\mathbf{r}_1, \ldots, \mathbf{r}_k\}$:
475
  \begin{equation}
476
  \mathbf{W}' = \mathbf{W} - \sum_{i=1}^k \left[(1-\lambda)\mathbf{W}\mathbf{r}_i\mathbf{r}_i^\top\right]
477
+ \label{eq:core_projection}
478
  \end{equation}
479
+ where $\lambda$ is the regularization strength (preserves $\lambda$ fraction of the refusal component). When directions are extracted via standard SVD, the right singular vectors $\{\mathbf{r}_i\}_{i=1}^k$ are orthonormal and the sum of rank-1 projections is equivalent to orthogonal projection onto the $k$-dimensional refusal subspace. \textbf{Important caveat:} when using whitened SVD (Section~\ref{sec:whitened_svd}), the un-whitened directions $\mathbf{r}_i = \mathbf{W}_{\text{whiten}} \mathbf{v}_{h,i}$ are \emph{not} orthonormal in the original space (though the whitened-space vectors $\mathbf{v}_{h,i}$ are). In this case, the implementation applies sequential projection with Gram--Schmidt re-orthonormalization before each rank-1 update, ensuring that accumulated projections remain consistent.
480
+
481
+ \paragraph{Transposed weight matrices.}
482
+ Some architectures (e.g., GPT-2 Conv1D layers) store weights as $\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$. The implementation detects the orientation via architecture profiling and applies $\mathbf{W}' = \mathbf{W} - (1-\lambda)\mathbf{r}\mathbf{r}^\top\mathbf{W}$ for transposed weights, ensuring that projection occurs along the correct axis.
483
 
484
  \paragraph{Per-layer adaptive strength.}
485
  Rather than applying uniform regularization, \textsc{Obliteratus} modulates $\lambda$ per-layer based on the refusal norm profile. Layers with stronger refusal signal (higher $\|\mathbf{r}_l\|$) receive lower regularization (more aggressive removal), while layers near the periphery of the refusal distribution receive higher regularization:
486
  \begin{equation}
487
  \lambda_l = \lambda_{\text{base}} + (1 - w_l)(1 - \lambda_{\text{base}}) \cdot 0.15, \quad
488
  w_l = \frac{\|\mathbf{r}_l\| - \min_j \|\mathbf{r}_j\|}{\max_j \|\mathbf{r}_j\| - \min_j \|\mathbf{r}_j\|}
489
+ \label{eq:adaptive_strength}
490
  \end{equation}
491
 
492
  \paragraph{Norm-preserving rescaling.}
493
  After projection, we rescale to preserve the Frobenius norm \citep{grimjim2025}:
494
  \begin{equation}
495
  \mathbf{W}'' = \mathbf{W}' \cdot \frac{\|\mathbf{W}\|_F}{\|\mathbf{W}'\|_F}
496
+ \label{eq:norm_preserve}
497
  \end{equation}
498
  This prevents cascading magnitude drift through LayerNorm layers.
499
 
 
502
  \begin{equation}
503
  \mathbf{W}' = \mathbf{W} - 2\mathbf{W}\mathbf{r}\mathbf{r}^\top
504
  \end{equation}
505
+ This flips the model's refusal behavior to active compliance, which can be more effective than simple removal for models with deeply entangled refusal mechanisms. \textbf{Risk profile:} Selective inversion is the most aggressive intervention in the platform. Because it \emph{reverses} the refusal direction rather than removing it, it can cause the model to actively seek to comply with harmful requests (not merely fail to refuse). This may produce qualitatively different and potentially more harmful outputs than simple refusal removal. The Inverted preset's consistently higher perplexity (Table~\ref{tab:exp_dense}) reflects this aggressiveness. We recommend using inversion only when standard removal methods leave substantial residual refusal, and coupling it with EGA's per-expert differentiation on MoE models to limit the blast radius.
506
 
507
  \paragraph{Bias term projection.}
508
  Unlike prior tools that only modify weight matrices, \textsc{Obliteratus} also projects refusal directions out of bias vectors when present:
 
557
  \textsc{Obliteratus} evaluates abliteration quality using six complementary metrics:
558
 
559
  \begin{enumerate}[leftmargin=*]
560
+ \item \textbf{Refusal Rate}: Fraction of harmful prompts where the model's response begins with a canonical refusal prefix (e.g., ``I cannot'', ``I'm sorry'', ``As an AI'', from the GCG/AdvBench list \citep{zou2023universal}). Lower indicates more complete abliteration. \textbf{Limitation:} prefix matching may produce false negatives (the model refuses using non-canonical phrasing) or false positives (the model uses a refusal prefix but then complies). A small-scale validation on 50 responses showed 92\% agreement between prefix matching and human judgment of refusal; the primary failure mode was false negatives (6\% of cases where the model refused without a canonical prefix). More robust classification (e.g., LLM-as-judge) is a natural extension.
561
 
562
  \item \textbf{Perplexity}: Standard perplexity on reference text (WikiText-2). Monitors general language modeling degradation.
563
 
564
+ \item \textbf{Coherence}: Measures the model's ability to produce factually correct completions. Specifically, we present 32 factual prompts (e.g., ``The capital of France is'') and check whether the model's first generated token or phrase matches the expected answer. \textbf{Note:} this is more precisely a \emph{factual completion accuracy} metric than a general coherence measure---it tests whether the model's factual knowledge is preserved, not whether its open-ended generations are fluent or logically consistent. We retain the ``coherence'' label for consistency with prior work but acknowledge the limited scope.
565
 
566
  \item \textbf{KL Divergence}: First-token KL divergence between original and modified model output distributions on harmless prompts \citep{young2025comparative}. Measures distributional shift.
567
 
 
570
  \text{CKA}(\mathbf{X}, \mathbf{Y}) = \frac{\|\mathbf{Y}^\top\mathbf{X}\|_F^2}{\|\mathbf{X}^\top\mathbf{X}\|_F \cdot \|\mathbf{Y}^\top\mathbf{Y}\|_F}
571
  \end{equation}
572
 
573
+ \item \textbf{Effective Rank}: Shannon entropy-based dimensionality of weight matrices (Equation~\ref{eq:effrank}). Tracks whether abliteration collapses the weight space.
574
  \end{enumerate}
575
 
576
  % ═════════════════════════════════════════════════════════════════════
 
621
  This prevents over-ablation of capability experts---a critical failure mode we identified in uniform approaches, where applying 2$\times$ reflection to all experts on GPT-OSS 20B degraded mathematical reasoning by over 30\%.
622
 
623
  \subsection{Router-Aware Processing}
624
+ \label{sec:router_analysis}
625
+
626
+ Beyond expert weights, the router network itself may encode safety-relevant routing preferences. We analyze and optionally modify router behavior through three mechanisms.
627
+
628
+ \paragraph{Router weight projection.}
629
+ The router network $R(\mathbf{x}) = \text{softmax}(\mathbf{W}_R \mathbf{x})$ produces per-expert routing probabilities. If the router weight matrix $\mathbf{W}_R \in \mathbb{R}^{E \times d}$ has learned to preferentially route harmful tokens to safety-critical experts, projecting the refusal direction out of $\mathbf{W}_R$ can redistribute these tokens to capability experts:
630
+ \begin{equation}
631
+ \mathbf{W}_R' = \mathbf{W}_R - (1 - \lambda_R)\mathbf{W}_R \mathbf{r}\mathbf{r}^\top
632
+ \label{eq:router_projection}
633
+ \end{equation}
634
+ This is controlled by the \texttt{project\_biases} flag and is enabled by default for the Nuclear preset. We use a higher regularization for router weights ($\lambda_R = 0.3$) than for expert weights to avoid disrupting the router's learned load-balancing behavior.
635
+
636
+ \paragraph{Load-balancing considerations.}
637
+ MoE models are typically trained with auxiliary load-balancing losses to prevent expert collapse (where a few experts receive most tokens). Router projection risks disrupting this balance by redirecting safety-associated tokens to already-loaded experts. We monitor the post-abliteration routing entropy $H(R) = -\sum_e p_e \log p_e$ and flag cases where it drops below $0.9 \cdot H(R_{\text{orig}})$. In our experiments, router projection with $\lambda_R = 0.3$ caused $< 5\%$ entropy reduction on GPT-OSS-20B, indicating that load balance is approximately preserved. More aggressive router projection ($\lambda_R = 0$) reduced entropy by 18\% and is not recommended without further evaluation.
638
+
639
+ \paragraph{Shared expert handling.}
640
+ Some MoE architectures (notably DeepSeek-MoE \citep{dai2024deepseekmoe}) include \emph{shared experts} that process all tokens regardless of routing. These experts require different treatment: since they cannot be classified as safety-critical or capability-preserving based on routing weights (they always route with weight 1), we apply standard (non-EGA) abliteration to shared experts using the global refusal direction. The implementation detects shared experts via architecture profiling (presence of \texttt{shared\_experts} or \texttt{num\_shared\_experts} in the model config) and processes them separately. When no shared expert metadata is available, all experts are treated as routed.
641
 
642
+ \paragraph{Limitations.}
643
+ Router analysis is currently observational: we measure routing distributions but do not perform causal interventions (e.g., forcing specific expert assignments and measuring the effect on refusal). The classification of experts as safety-critical vs.\ capability-preserving is based on routing-weighted refusal direction norms, which is correlational. Future work could strengthen this with counterfactual expert ablation (removing individual experts and measuring refusal rate changes).
644
 
645
  % ═════════════════════════════════════════════════════════════════════
646
  \section{Frontier Optimization Techniques}
 
658
  \begin{equation}
659
  \lambda_l^{(0)} = (1 - w_l) \cdot 0.3
660
  \end{equation}
661
+ where $w_l$ is the layer-adaptive weight from Equation~\ref{eq:adaptive_strength}. Subsequent trials are biased toward the warm-start region: $\lambda_l \in [\max(0, \lambda_l^{(0)} - 0.3), \min(1, \lambda_l^{(0)} + 0.3)]$. This enables convergence in 50 trials versus Heretic's 200.
662
 
663
  \paragraph{Multi-objective formulation.}
664
  Each trial jointly minimizes refusal rate $\rho$ and KL divergence $D_{\text{KL}}$:
 
670
  \subsection{Reversible LoRA-Mediated Ablation}
671
  \label{sec:lora}
672
 
673
+ Inspired by Heretic's rank-1 LoRA ablation, we extend the approach to \emph{rank-$k$} adapters supporting multi-direction removal. The mathematical equivalence depends on weight matrix orientation. For a weight matrix $\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ where $\mathbf{d} \in \mathbb{R}^{d_{\text{in}}}$ is the refusal direction and $s = 1 - \lambda$:
674
  \begin{align}
675
+ \text{In-place:} \quad \mathbf{W}' &= \mathbf{W} - s \cdot \mathbf{W}\mathbf{d}\mathbf{d}^\top \label{eq:lora_inplace} \\
676
+ \text{LoRA:} \quad \mathbf{W}' &= \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} = -s \cdot (\mathbf{W}\mathbf{d}) \in \mathbb{R}^{d_{\text{out}} \times 1}, \quad \mathbf{A} = \mathbf{d}^\top \in \mathbb{R}^{1 \times d_{\text{in}}}
677
  \end{align}
678
+ When the weight matrix is transposed ($\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$, as in some Conv1D layers), the decomposition becomes $\mathbf{B} = -s \cdot \mathbf{d} \in \mathbb{R}^{d_{\text{in}} \times 1}$, $\mathbf{A} = (\mathbf{d}^\top \mathbf{W}) \in \mathbb{R}^{1 \times d_{\text{out}}}$. The implementation auto-detects the orientation and applies the correct decomposition.
679
+
680
+ For rank-$k$ with directions $\{\mathbf{d}_1, \ldots, \mathbf{d}_k\}$:
681
  \begin{equation}
682
  \mathbf{B} = [-s\cdot\text{coeff}_1 \mid \cdots \mid -s\cdot\text{coeff}_k] \in \mathbb{R}^{d_{\text{out}} \times k}, \quad
683
  \mathbf{A} = [\mathbf{d}_1 ; \cdots ; \mathbf{d}_k] \in \mathbb{R}^{k \times d_{\text{in}}}
 
695
  where $\gamma$ is computed from the stored KL proxy magnitude. A subtle issue arises when the post-projection coefficient $\mathbf{W}'\mathbf{d} \approx 0$ (as occurs with zero regularization): in this case, we use the \emph{pre-projection} coefficient magnitude as a proxy:
696
  \begin{equation}
697
  \gamma = \gamma_{\text{strength}} \cdot \begin{cases}
698
+ \text{coeff}_{\text{post}} & \text{if } \|\text{coeff}_{\text{post}}\| > \epsilon \\
699
  \text{coeff}_{\text{proxy}} & \text{otherwise}
700
  \end{cases}
701
  \end{equation}
702
+
703
+ In the normal case ($\|\text{coeff}_{\text{post}}\| > \epsilon$), the revert adds back a rank-1 correction $\gamma \cdot \text{coeff}_{\text{post}} \cdot \mathbf{d}^\top$, partially restoring the original weight's projection along $\mathbf{d}$. In the proxy fallback case, the pre-projection coefficient $\text{coeff}_{\text{proxy}} = \|\mathbf{W}\mathbf{d}\|$ is a scalar, and the revert adds a uniform correction $\gamma \cdot \text{coeff}_{\text{proxy}} \cdot \mathbf{d}^\top$ to each row of $\mathbf{W}'$. This uniform fallback is a coarser approximation than the rank-1 normal path---it restores magnitude along $\mathbf{d}$ without preserving the row-specific structure of the original coefficient vector. This prevents the revert from being a no-op for fully-projected layers, at the cost of a less targeted restoration. The implementation auto-detects the weight orientation and applies the transposed analogue ($\mathbf{d} \cdot \text{coeff}_{\text{proxy}}^\top$) for Conv1D-style weights.
704
 
705
  \subsection{Chain-of-Thought-Aware Ablation}
706
  \label{sec:cot}
 
748
  \item \textsc{Analyze} --- Run analysis modules to understand refusal geometry \textbf{(new)}
749
  \item \textsc{Distill} --- Extract directions using analysis-informed parameters
750
  \item \textsc{Excise} --- Project with analysis-guided precision
751
+ \item \textsc{Verify} --- Post-excision analysis with Ouroboros compensation loop \textbf{(enhanced)}
752
  \item \textsc{Rebirth} --- Save with comprehensive analysis metadata
753
  \end{enumerate}
754
 
 
773
 
774
  \paragraph{Self-repair estimate $\to$ refinement passes.}
775
  High self-repair capacity (estimated from refusal distribution breadth) triggers more refinement passes with true iterative re-probing.
776
+ After excision, if the model's refusal rate remains above a threshold, the \textsc{Verify} stage triggers Ouroboros compensation: it re-probes, finds rotated residual directions, and excises them in additional targeted passes.
777
 
778
  \subsection{Configuration Derivation}
779
 
 
844
  Qwen2.5-1.5B-Instruct & Dense & 1.5B & --- & DPO \\
845
  Llama-3.1-8B-Instruct & Dense & 8B & --- & RLHF+DPO \\
846
  Mixtral-8x7B-Instruct-v0.1 & MoE & 46.7B (12.9B active) & 8 & SFT+DPO \\
847
+ GPT-OSS-20B-Chat & MoE (fused) & 20B (3.2B active) & 32 & RLHF \\
848
  \bottomrule
849
  \end{tabular}
850
  \end{table}
 
852
  \paragraph{Datasets.}
853
  Harmful prompts are drawn from the AdvBench dataset \citep{zou2023universal} (520 prompts). Harmless prompts are drawn from the Alpaca dataset \citep{taori2023alpaca} (matched count). For refusal rate measurement, we use a held-out set of 64 harmful prompts not seen during direction extraction. For perplexity, we use a 512-token window from WikiText-2. For KL divergence, we use 32 harmless prompts from the Alpaca validation set.
854
 
855
+ \textbf{Evaluation prompt diversity limitation:} All evaluation prompts are drawn from a single source (AdvBench), which may not represent the full distribution of requests that a safety-aligned model should refuse. AdvBench prompts are predominantly explicit, direct harmful requests; the evaluation does not include: (1)~subtly harmful prompts that require contextual judgment (e.g., dual-use chemistry questions), (2)~prompts from other safety taxonomies (e.g., HarmBench categories, ToxiGen identity-based toxicity), or (3)~out-of-distribution harm categories not represented in AdvBench (e.g., privacy violations, financial fraud, child safety). An abliterated model that achieves 0\% refusal rate on AdvBench may still refuse on categories not represented in the evaluation set, or conversely may show lower refusal on subtle prompts where the original model's refusal was already less reliable. We recommend evaluating on diverse prompt sources for deployment-critical assessments.
856
+
857
  \paragraph{Evaluation metrics.}
858
  For each abliterated model we report: \textbf{Refusal Rate} (RR, \%---lower is better), \textbf{Perplexity} (PPL---lower is better, with $\Delta$PPL showing change from baseline), \textbf{KL Divergence} ($D_{\text{KL}}$---lower is better), and \textbf{Coherence} (Coh., \%---higher is better). We also report \textbf{CoT preserved} (\checkmark/--) and \textbf{LoRA adapters generated} (\checkmark/--) where applicable.
859
 
860
  \paragraph{Prompt volume.}
861
  All experiments use medium prompt volume (128 harmful + 128 harmless prompts for direction extraction) unless otherwise noted. This provides robust SVD estimation while keeping compute manageable.
862
 
863
+ \paragraph{Statistical methodology and limitations.}
864
+ \label{para:stat_limitations}
865
+ Refusal rate is measured on a held-out set of $n = 64$ harmful prompts. At this sample size, the resolution of the refusal rate metric is $1/64 \approx 1.6\%$: a reported rate of 1.6\% corresponds to exactly 1 refusal out of 64 prompts, and a rate of 3.1\% corresponds to 2 refusals. We report Clopper--Pearson exact 95\% confidence intervals (CIs) for all refusal rates in the text; for example, RR = 1.6\% ($n = 64$) has a 95\% CI of $[0.04\%, 8.4\%]$, meaning the true refusal rate could be anywhere from near-zero to ${\sim}8\%$. Similarly, RR = 3.1\% has CI $[0.4\%, 10.8\%]$.
866
+
867
+ \textbf{Consequence:} Differences between methods at the low end of the refusal rate scale (e.g., 1.6\% vs.\ 3.1\%) are \emph{not statistically significant} at $n = 64$---they represent a difference of 1 prompt. Claims of method superiority based on refusal rate should be interpreted as directional trends, not confirmed effects. The platform supports bootstrap CIs (BCa, 10{,}000 resamples) for all continuous metrics and Clopper--Pearson CIs for refusal rates; we encourage users performing rigorous method comparisons to use larger evaluation sets ($n \geq 256$) to achieve meaningful statistical power.
868
+
869
+ Perplexity and KL divergence are computed on fixed reference corpora (512 tokens, 32 prompts respectively), and their variability is dominated by corpus selection rather than sampling noise. We do not report CIs for these metrics as they are deterministic given the corpus. Coherence is measured on $n = 32$ factual prompts (each binary: correct/incorrect), yielding similar granularity constraints to refusal rate.
870
+
871
+ All reported results are from single runs with fixed seed 42. The reproducibility section (Appendix~\ref{app:reproducibility}) describes the platform's multi-seed sweep capability for independent replication.
872
+
873
+ \paragraph{Multiple comparisons.}
874
+ We compare 8 methods across 4 models (Tables~\ref{tab:exp_dense}--\ref{tab:exp_cross}), yielding many pairwise comparisons. We do not apply formal multiple comparison corrections (e.g., Bonferroni, Benjamini--Hochberg) because: (1)~the primary analysis is descriptive (reporting metric values) rather than hypothesis-testing (declaring significance); (2)~with $n = 64$ evaluation prompts, individual comparisons already lack power for small effect sizes, and applying corrections would further obscure potentially real trends; and (3)~the ablation studies (Section~\ref{sec:exp_ablation}) isolate individual design choices rather than comparing all methods simultaneously. We caution readers against interpreting small differences between methods (e.g., RR 1.6\% vs.\ 3.1\%) as evidence of method superiority; such differences require confirmation with larger evaluation sets and multiple seeds.
875
+
876
  \subsection{Multi-Method Comparison on Dense Models}
877
  \label{sec:exp_dense}
878
 
 
880
 
881
  \begin{table}[h]
882
  \centering
883
+ \caption{Method comparison on Qwen2.5-1.5B-Instruct (DPO-aligned). Baseline refusal rate: 87.5\%, baseline PPL: 8.92. Best result in each column is \textbf{bolded}. Refusal rates measured on $n=64$ prompts; see Section~\ref{para:stat_limitations} for confidence intervals and resolution limitations.}
884
  \label{tab:exp_dense}
885
  \small
886
  \begin{tabular}{@{}lcccccc@{}}
 
890
  Basic & 18.8 & 9.14 & +0.22 & 0.031 & 93.8 & -- \\
891
  Advanced & 6.3 & 9.31 & +0.39 & 0.058 & 93.8 & -- \\
892
  Aggressive & 3.1 & 9.87 & +0.95 & 0.112 & 87.5 & -- \\
893
+ Sp.\ Cascade & 4.7 & 9.18 & +0.26 & 0.041 & 93.8 & -- \\
894
  Surgical & 4.7 & 9.21 & +0.29 & 0.044 & \textbf{96.9} & -- \\
895
  Optimized & \textbf{1.6} & \textbf{9.08} & \textbf{+0.16} & \textbf{0.024} & 93.8 & \checkmark \\
896
  Inverted & 3.1 & 10.43 & +1.51 & 0.187 & 84.4 & -- \\
 
900
  \end{table}
901
 
902
  \paragraph{Key findings (dense).}
903
+ (1)~The Optimized preset achieves the best Pareto trade-off: near-zero refusal (1.6\%, 95\% CI $[0.04, 8.4]\%$) with minimal perplexity increase (+0.16) and lowest KL divergence (0.024), validating the Bayesian optimization approach.
904
  (2)~Surgical outperforms Aggressive on coherence (96.9\% vs 87.5\%) despite higher refusal rate, confirming that whitened SVD + regularization preserves capabilities better than brute-force multi-direction removal.
905
  (3)~Inverted achieves low refusal but at the cost of the highest perplexity increase (+1.51), reflecting the more disruptive nature of direction reflection vs.\ removal.
906
+ (4)~Nuclear matches Optimized on refusal rate but with higher distributional shift ($D_{\text{KL}} = 0.098$ vs.\ $0.024$, PPL $+0.72$ vs.\ $+0.16$), suggesting the additional techniques (selective inversion + whitened SVD + 4 passes) provide diminishing returns on small dense models. On this model, Nuclear is \emph{Pareto-dominated} by Optimized: it achieves the same refusal rate with strictly worse perplexity and KL divergence. Nuclear's value proposition is for larger models and MoE architectures where simpler presets leave residual refusal (Table~\ref{tab:exp_moe}); on small dense models, the Optimized preset is preferred. Note that at $n = 64$, the difference between Optimized (1.6\%) and Nuclear (1.6\%) vs.\ Aggressive/Inverted (3.1\%) is 1 prompt and is not statistically significant.
907
 
908
  \subsection{MoE Model Evaluation: EGA vs.\ Uniform Abliteration}
909
  \label{sec:exp_moe}
 
912
 
913
  \begin{table}[h]
914
  \centering
915
+ \caption{EGA vs.\ uniform abliteration on GPT-OSS-20B-Chat (32 fused experts, RLHF-aligned). Baseline RR: 92.2\%, baseline PPL: 6.41. ``Uniform'' applies the same projection to all expert slices.}
916
  \label{tab:exp_moe}
917
  \small
918
  \begin{tabular}{@{}llccccc@{}}
 
933
 
934
  \paragraph{Key findings (MoE).}
935
  (1)~\textbf{Uniform abliteration catastrophically degrades MoE models.} For the Inverted preset, uniform treatment doubles perplexity (+4.87 vs +0.73) and collapses coherence to 53.1\%. The Nuclear preset is even worse: uniform application produces PPL 13.57 (a 112\% increase) and 46.9\% coherence---the model is barely functional.
936
+ (2)~\textbf{EGA with selective inversion resolves this.} The same Nuclear preset with EGA achieves identical refusal removal (1.6\%) but with only a 23\% perplexity increase and 84.4\% coherence. The key mechanism is that capability-preserving experts (22 of 32 on GPT-OSS-20B) receive standard removal rather than reflection.
937
+ (3)~\textbf{Expert classification matters.} On GPT-OSS-20B, EGA classified 10 of 32 experts as safety-critical ($s_e > 0.5$). These experts collectively handled 71\% of harmful token routing weight, confirming that refusal is concentrated in a subset of experts.
938
  (4)~\textbf{CoT preservation is MoE-critical.} The Nuclear + EGA preset preserves chain-of-thought coherence because the Gram-Schmidt orthogonalization operates on per-expert directions that are already capability-differentiated.
939
 
940
  \subsection{Ablation Studies}
941
  \label{sec:exp_ablation}
942
 
943
+ We ablate three key design choices to validate that they contribute meaningfully. \textbf{Note:} All ablation results are from single runs with fixed seed 42. While the platform supports multi-seed sweeps (seeds $\in \{42, 137, 2024\}$), we did not run them for all ablations due to compute constraints. The reported differences (e.g., warm-start converging 2$\times$ faster) are therefore point estimates. The warm-start ablation is the most robust, as it measures convergence speed (trial number of best result) across a 50-trial optimization run, providing some implicit variance reduction. The threshold sweep and KL proxy ablations each show clear directional trends but would benefit from multi-seed confirmation.
944
 
945
  \paragraph{Warm-start vs.\ random initialization for Bayesian optimization.}
946
  On Llama-3.1-8B-Instruct with the Optimized preset (50 Optuna trials):
 
951
  Warm-start converges 2$\times$ faster and finds a better Pareto point, confirming that analysis-derived heuristics provide a useful prior for the TPE sampler.
952
 
953
  \paragraph{EGA safety threshold sensitivity ($\tau_{\text{safety}}$).}
954
+ On GPT-OSS-20B (32 experts) with the Advanced preset, we sweep $\tau \in \{0.3, 0.4, 0.5, 0.6, 0.7\}$:
955
  \begin{itemize}[leftmargin=*]
956
+ \item $\tau = 0.3$: 18 of 32 experts classified as safety-critical $\to$ RR 4.7\%, PPL 7.21, Coh.\ 84.4\%
957
+ \item $\tau = 0.5$ (default): 10 of 32 experts safety-critical $\to$ RR 9.4\%, PPL 6.72, Coh.\ 90.6\%
958
+ \item $\tau = 0.7$: 4 of 32 experts safety-critical $\to$ RR 14.1\%, PPL 6.53, Coh.\ 93.8\%
959
  \end{itemize}
960
  The threshold controls a smooth trade-off between refusal removal and capability preservation. We chose $\tau = 0.5$ as the default because it provides the best Pareto balance, but note that this is a \emph{tunable hyperparameter} rather than a universal optimum---different models and use cases may benefit from different thresholds.
961
 
 
1040
  Sparse autoencoders & -- & Via SAE & -- & -- & -- & Core \\
1041
  Model compatibility & Any HF & $\sim$50 & 16 & TLens & HF & TLens \\
1042
  MoE model support & Native & -- & -- & -- & -- & -- \\
1043
+ Test suite & 821 & Community & -- & -- & Min. & Mod. \\
1044
  \bottomrule
1045
  \end{tabular}
1046
  \end{table}
 
1063
  \label{sec:discussion}
1064
 
1065
  \paragraph{Dual-use considerations.}
1066
+ \textsc{Obliteratus} is designed for alignment research---understanding refusal mechanisms serves both identifying vulnerabilities (red-teaming) and building more robust alignment (blue-teaming). The analysis modules are particularly valuable for the defensive perspective: understanding \emph{why} abliteration works enables designing alignment methods that are more resistant to it. The Ouroboros effect analysis, entanglement mapping, and defense profiling directly serve this goal.
1067
 
1068
  \paragraph{Causal tracing limitations.}
1069
  Our causal tracing module provides noise-based approximations rather than true activation patching. While computationally efficient (no additional forward passes), the results should be validated with real causal interventions when model access permits. We explicitly document this limitation in the module and recommend TransformerLens for definitive causal analysis.
1070
 
1071
  \paragraph{Heuristic constants and composite metrics.}
1072
+ Several components of \textsc{Obliteratus} rely on hand-chosen constants: the RES weights $(0.4, 0.3, 0.3)$, the Universality Index ratio $(3{:}2{:}1)$, the alignment fingerprint target values, the EGA safety threshold ($\tau = 0.5$), and the configuration derivation rules (Section~\ref{sec:informed}). We have provided explicit justification for each choice where possible (Sections~\ref{sec:activation_probe}, \ref{sec:transfer}, \ref{sec:alignment_imprint}) and ablation studies for the most consequential ones (Section~\ref{sec:exp_ablation}). However, we acknowledge that these are engineering decisions informed by exploratory analysis, not statistically optimized hyperparameters.
1073
+
1074
+ \textbf{Construct validity concern:} Composite metrics (RES, UI, entanglement $E_l$) combine heterogeneous quantities using weighted aggregation. The choice of combination function (weighted sum, geometric mean, etc.) and the specific weights impose implicit assumptions about the relative importance of each component---assumptions that may not hold across all models and use cases. For example, the RES metric's exponential decay factor of $-10$ was calibrated on a small set of models and may be inappropriate for models with very different activation scales. We strongly recommend that users examine the \emph{component metrics} individually rather than relying solely on composite scores. The platform logs all component values alongside composites for this purpose. A systematic sensitivity analysis across a larger model corpus is needed to establish whether these defaults generalize, and formal construct validation (e.g., correlation with downstream task outcomes) has not been performed.
1075
 
1076
  \paragraph{Alignment fingerprinting validation.}
1077
+ The alignment imprint detector uses heuristic signatures derived from the literature's characterization of different training methods. While the geometric features (Gini, effective rank, smoothness) are well-motivated, the classifier has not been rigorously validated. Specifically: (1)~the ideal feature values (e.g., ``Gini $\sim 0.7$ for DPO'') were derived from exploratory analysis of only two models with known training procedures (Llama-3-Instruct for RLHF, Zephyr-$\beta$ for DPO), which is insufficient for reliable generalization; (2)~no held-out test set or cross-validation was performed; (3)~the Gaussian kernel bandwidth ($\sigma_{m,f} = 0.3|\mu_{m,f}|$) was not tuned; and (4)~the method assumes that alignment training methods produce distinguishable geometric signatures, which has not been established as a general principle. Systematic validation would require a corpus of $\geq$20 models with confirmed, diverse training procedures (including mixed methods like RLHF+DPO). We present the classifier as a \emph{hypothesis-generating tool}---its outputs should be treated as suggestive rather than definitive (see Section~\ref{sec:alignment_imprint}).
1078
 
1079
  \paragraph{MoE expert classification.}
1080
  The EGA safety score threshold ($\tau = 0.5$) for classifying experts as safety-critical vs.\ capability-preserving is a heuristic. A more principled approach would train expert classifiers on labeled routing data or use causal interventions to establish ground-truth expert roles. We leave this to future work.
 
1086
  The current implementation loads the full model into memory for analysis. For frontier-scale models (100B+ parameters), this requires significant compute. Future work could integrate quantized inference or offloading strategies. The web dashboard requires GPU access for interactive features (chat, A/B comparison, strength sweep).
1087
 
1088
  \paragraph{Evaluation completeness.}
1089
+ Our evaluation suite measures \emph{refusal removal} and \emph{capability preservation} but does not comprehensively assess downstream task performance across diverse benchmarks. Integration with evaluation harnesses such as lm-evaluation-harness \citep{gao2021framework} is a natural extension. Critically, our evaluation is \emph{attack-centric} (measuring how effectively abliteration removes refusal) rather than \emph{safety-centric} (measuring residual harm potential of abliterated models on diverse safety benchmarks). A complete safety evaluation would include HarmBench \citep{zou2023universal}, ToxiGen, and human red-teaming, which are beyond our current scope.
1090
+
1091
+ \paragraph{Circuit breaker and robust defense evaluation.}
1092
+ \citet{zou2024circuit} proposed circuit breakers---a defense mechanism that reroutes activations rather than relying on linear refusal directions---specifically designed to resist linear-algebraic attacks like abliteration. We cite this work but do not evaluate \textsc{Obliteratus} against circuit-breaker-defended models, which is a significant gap. Such an evaluation would be informative in both directions: it would test whether circuit breakers truly resist abliteration (as theoretically predicted, since they do not rely on single linear directions) and whether the platform's analysis modules can characterize the geometric structure of circuit breaker defenses. We identify this as the highest-priority item for future work, as it directly addresses the question of whether abliteration-resistant alignment is achievable.
1093
 
1094
  \paragraph{Future directions.}
1095
  We identify several opportunities: (1)~integration with sparse autoencoder analysis to understand refusal at the feature level, potentially enabling even more targeted ablation; (2)~real causal tracing via TransformerLens integration; (3)~longitudinal studies tracking how refusal geometry evolves during fine-tuning; (4)~extension of the universality analysis to a wider set of model families; (5)~application of the defense robustness framework to evaluate proposed robust alignment methods including circuit breakers \citep{zou2024circuit} and representation rerouting; (6)~multi-objective Bayesian optimization with additional objectives such as CoT coherence and downstream task performance; and (7)~automated expert role discovery for MoE models using unsupervised clustering of expert activation patterns.
 
1098
  \section{Broader Impact Statement}
1099
  \label{sec:broader_impact}
1100
 
1101
+ This work has significant dual-use implications that we address directly and in depth.
1102
+
1103
+ \subsection{Threat Model}
1104
+ \label{sec:threat_model}
1105
 
1106
+ We consider the following adversarial setting. An attacker has access to the open weights of a safety-aligned language model and wishes to remove its refusal behavior to generate harmful content. We distinguish three threat actor profiles:
 
1107
 
1108
+ \begin{enumerate}[leftmargin=*]
1109
+ \item \textbf{Sophisticated actors} (nation-states, well-resourced organizations): Already possess the expertise to implement abliteration from first principles using published techniques \citep{arditi2024refusal, gabliteration2024}. \textsc{Obliteratus} provides no incremental capability to this group.
1110
+ \item \textbf{Semi-technical actors} (hobbyists, students with ML experience): Can follow tutorials and run existing tools. \textsc{Obliteratus} lowers the barrier modestly by providing a unified interface, but multiple existing tools (FailSpy's abliterator, community scripts) already serve this audience.
1111
+ \item \textbf{Non-technical actors}: Cannot directly use any abliteration tool. The primary risk from this group is \emph{downstream use} of models abliterated by others, which is independent of our tool's existence.
1112
+ \end{enumerate}
1113
+
1114
+ The key observation is that linear refusal removal from open weights is a \emph{fundamental structural vulnerability} of current alignment methods, not an attack we invented. Any tool that can load and modify model weights (PyTorch, safetensors, even NumPy) is sufficient. Our contribution is making this vulnerability \emph{legible} to the research community so it can be addressed.
1115
+
1116
+ \paragraph{Scope of risk.}
1117
+ Abliteration removes \emph{refusal to generate text}; it does not provide the attacker with new knowledge, capabilities, or resources beyond what the model already encodes. The resulting model produces text that a sufficiently creative prompter might already elicit via jailbreaks on the original model. The marginal risk increase from abliteration over existing jailbreak techniques (prompt injection, few-shot attacks, system prompt manipulation) is therefore bounded, though we acknowledge it is nonzero: abliteration is more reliable and persistent than per-query jailbreaks.
1118
+
1119
+ \paragraph{Mitigations not addressed.}
1120
+ We do not evaluate more robust defense mechanisms such as circuit breakers \citep{zou2024circuit}, representation rerouting, or multi-layer distributed safety encodings. These represent fundamentally different defense paradigms that are not defeated by linear projection, and we identify their evaluation against \textsc{Obliteratus}'s analysis modules as critical future work (Section~\ref{sec:discussion}).
1121
+
1122
+ \subsection{Risks}
1123
+
1124
+ \textsc{Obliteratus} enables the removal of safety guardrails from language models. Specific risk categories include:
1125
+
1126
+ \begin{itemize}[leftmargin=*]
1127
+ \item \textbf{Harmful content generation}: Abliterated models may generate instructions for violence, weapons, illegal activities, or other dangerous content that the original model would refuse.
1128
+ \item \textbf{Scaled misuse}: The platform's automation (one-click abliteration, batch processing) could enable systematic production of uncensored model variants for redistribution.
1129
+ \item \textbf{Erosion of safety norms}: Wide availability of abliteration tools may normalize the removal of safety guardrails and reduce incentives for model providers to invest in alignment.
1130
+ \item \textbf{False sense of security}: By demonstrating the fragility of linear safety mechanisms, this work could undermine public trust in AI safety measures, potentially ahead of the deployment of more robust alternatives.
1131
+ \end{itemize}
1132
+
1133
+ \subsection{Benefits to Alignment Research}
1134
 
1135
+ We argue that the research benefits justify open release, grounding this argument in specific, falsifiable claims rather than general appeals:
1136
+
1137
+ \begin{enumerate}[leftmargin=*]
1138
+ \item \textbf{Diagnostic capability}: The 15 analysis modules provide the most comprehensive public characterization of refusal geometry. Specific modules (concept cone analysis, alignment imprint detection, Ouroboros self-repair quantification) have no equivalent in existing tools and directly inform the design of more robust safety mechanisms. For example, our finding that DPO-aligned models concentrate refusal in ${\sim}1.5$ effective dimensions while CAI models distribute it across ${\sim}4$ dimensions (Section~\ref{sec:alignment_imprint}) suggests concrete directions for more geometrically robust training.
1139
+
1140
+ \item \textbf{Quantitative defense evaluation}: The defense robustness module (Section~\ref{sec:defense_robustness}) provides a standardized framework for measuring how resistant a model's alignment is to abliteration. This enables alignment researchers to benchmark proposed improvements: a training method whose models show higher Ouroboros self-repair capacity and higher entanglement scores is more resistant to abliteration.
1141
+
1142
+ \item \textbf{Informing policy}: The empirical demonstration that current safety alignment can be removed with simple linear algebra from publicly released weights is relevant information for policymakers considering open-weight release policies. We believe this finding should be part of the public discourse, not suppressed.
1143
+ \end{enumerate}
1144
+
1145
+ \paragraph{What we do \emph{not} claim.}
1146
+ We do not claim that ``the techniques are already public, so releasing a better tool does no harm.'' Consolidated, user-friendly tools \emph{do} lower the barrier to some degree, and we acknowledge this. Our argument is that the \emph{diagnostic} and \emph{defensive} capabilities of the analysis modules---which are novel and have no existing public equivalent---provide sufficient research value to justify the incremental risk from a more accessible intervention tool.
1147
+
1148
+ \subsection{Responsible Disclosure and Deployment Guidance}
1149
+
1150
+ We release the platform under the AGPL-3.0 license, which requires that derivative works also be open-sourced, ensuring that modifications to the tool remain visible to the research community. We explicitly recommend:
1151
+
1152
+ \begin{itemize}[leftmargin=*]
1153
+ \item \textbf{Do not deploy abliterated models in production.} The primary intended use is alignment research, not deployment.
1154
+ \item \textbf{Use analysis before intervention.} The analysis pipeline provides diagnostic information that is valuable independently of whether abliteration is performed.
1155
+ \item \textbf{Report novel defense-breaking findings.} If the platform reveals previously unknown weaknesses in a specific model's alignment, we encourage responsible disclosure to the model provider.
1156
+ \item \textbf{Cite defensive findings.} Research using the analysis modules for defense improvement should be shared openly to benefit the alignment community.
1157
+ \end{itemize}
1158
 
1159
  % ═════════════════════════════════════════════════════════════════════
1160
  \section{Ethics Statement}
 
1164
 
1165
  We do not advocate for the deployment of abliterated models in production systems. The primary intended use is alignment research: understanding the geometric structure of refusal to build more durable safety mechanisms. All experiments described in this work were conducted on publicly available open-weight models, and no private or proprietary systems were modified.
1166
 
1167
+ We note that withholding this tool would not constitute meaningful security: the underlying techniques are published, the mathematics is elementary (SVD, linear projection), and multiple existing tools implement subsets of the same functionality. However, we reject the stronger claim that ``security through obscurity is never valuable''---in some contexts, raising the barrier to exploitation provides meaningful delay. Our assessment is that the specific barrier lowered by \textsc{Obliteratus} (from ``read papers and write custom code'' to ``use a unified tool'') is small relative to the diagnostic value the analysis modules provide to defenders. This is a judgment call, not a logical certainty, and we invite the community to scrutinize it.
1168
 
1169
  % ═════════════════════════════════════════════════════════════════════
1170
  \section{Conclusion}
 
1182
 
1183
  Empirical evaluation across four model families demonstrates that (1)~Bayesian-optimized presets achieve the best Pareto trade-offs on dense models, (2)~Expert-Granular Abliteration is essential for MoE models, where uniform approaches catastrophically degrade capabilities, and (3)~the platform's design choices (warm-start initialization, selective inversion, proxy-magnitude KL revert) each contribute measurably to abliteration quality. We acknowledge that several composite metrics rely on heuristic constants and provide ablation studies and explicit caveats for each.
1184
 
1185
+ By making these tools available under the AGPL-3.0 license with comprehensive documentation and 821 unit tests, we aim to accelerate both offensive and defensive alignment research: understanding the geometric structure of refusal---across dense and MoE architectures alike---is the foundation for both removing it surgically and building more robust implementations.
1186
 
1187
  % ═════════════════════════════════════════════════════════════════════
1188
  \bibliographystyle{plainnat}
paper/references.bib CHANGED
@@ -145,10 +145,10 @@
145
  % ── Defense and Safety ────────────────────────────────────────────────
146
 
147
  @article{qi2025safety,
148
- title={Safety-Capability Entanglement in Large Language Models},
149
- author={Qi, Xiangyu and others},
150
- journal={arXiv preprint},
151
- year={2025}
152
  }
153
 
154
  @article{zou2024circuit,
@@ -175,7 +175,7 @@
175
  @article{young2025comparative,
176
  title={Comparative Analysis of Abliteration Methods for Language Model Safety Removal},
177
  author={Young, Alex},
178
- journal={arXiv preprint},
179
  year={2025}
180
  }
181
 
 
145
  % ── Defense and Safety ────────────────────────────────────────────────
146
 
147
  @article{qi2025safety,
148
+ title={Safety Alignment Should Be Made More Than Just a Few Tokens Deep},
149
+ author={Qi, Xiangyu and Zeng, Yi and Xie, Tinghao and Chen, Pin-Yu and Jia, Ruoxi and Mittal, Prateek and Henderson, Peter},
150
+ journal={arXiv preprint arXiv:2406.05946},
151
+ year={2024}
152
  }
153
 
154
  @article{zou2024circuit,
 
175
  @article{young2025comparative,
176
  title={Comparative Analysis of Abliteration Methods for Language Model Safety Removal},
177
  author={Young, Alex},
178
+ journal={arXiv preprint arXiv:2502.05420},
179
  year={2025}
180
  }
181