Spaces:
Running on Zero
Running on Zero
Upload 130 files
Browse files- README.md +1 -1
- docs/theory_journal.md +9 -9
- hf-spaces/README.md +75 -0
- obliteratus/analysis/anti_ouroboros.py +3 -3
- obliteratus/informed_pipeline.py +27 -28
- obliteratus/local_ui.py +1 -1
- obliteratus/lora_ablation.py +8 -3
- paper/appendix.tex +1 -1
- paper/main.tex +157 -58
- paper/references.bib +5 -5
README.md
CHANGED
|
@@ -121,7 +121,7 @@ obliteratus ui --auth user:pass # add basic auth
|
|
| 121 |
|
| 122 |
You can also run directly with `python app.py` (used by HF Spaces). The `obliteratus ui` command adds a beautiful Rich terminal startup with GPU detection, hardware-appropriate model recommendations, and auto-browser-open.
|
| 123 |
|
| 124 |
-
Deploy on [HuggingFace Spaces](https://huggingface.co/spaces) with a free T4 GPU for cloud access β see [spaces/README.md](spaces/README.md) for setup.
|
| 125 |
|
| 126 |
### Option B: Colab
|
| 127 |
|
|
|
|
| 121 |
|
| 122 |
You can also run directly with `python app.py` (used by HF Spaces). The `obliteratus ui` command adds a beautiful Rich terminal startup with GPU detection, hardware-appropriate model recommendations, and auto-browser-open.
|
| 123 |
|
| 124 |
+
Deploy on [HuggingFace Spaces](https://huggingface.co/spaces) with a free T4 GPU for cloud access β see [hf-spaces/README.md](hf-spaces/README.md) for setup.
|
| 125 |
|
| 126 |
### Option B: Colab
|
| 127 |
|
docs/theory_journal.md
CHANGED
|
@@ -536,7 +536,7 @@ reveals that refusal circuits involve non-linear interactions:
|
|
| 536 |
|
| 537 |
**Implication:** Linear projection can remove the refusal *representation* from the residual
|
| 538 |
stream but cannot disable the non-linear *circuit* that generates it. The circuit may
|
| 539 |
-
reconstruct the refusal signal from other features (the
|
| 540 |
of this).
|
| 541 |
|
| 542 |
**Proposed solution: Circuit-Level Ablation**
|
|
@@ -686,7 +686,7 @@ Phase 2: PROBE
|
|
| 686 |
Phase 3: ANALYZE
|
| 687 |
3.1 Compute refusal geometry: ConceptConeAnalyzer β {linear, polyhedral}
|
| 688 |
3.2 Cross-layer analysis β direction clusters + persistence score
|
| 689 |
-
3.3 Defense robustness profiling β
|
| 690 |
3.4 If polyhedral: extract per-category directions
|
| 691 |
3.5 Set configuration based on analysis results
|
| 692 |
|
|
@@ -729,7 +729,7 @@ Phase 6: VERIFY
|
|
| 729 |
6.2 KL divergence on 100 harmless prompts
|
| 730 |
6.3 Perplexity on reference corpus
|
| 731 |
6.4 If informed: post-excision activation probing for residual refusal
|
| 732 |
-
6.5 If informed:
|
| 733 |
|
| 734 |
Phase 7: REBIRTH
|
| 735 |
7.1 Save model with metadata (method, directions, layers, metrics)
|
|
@@ -752,7 +752,7 @@ signal(k) = signal(0) Β· (1 - Ξ±_effective)^k = signal(0) Β· 0.3^k
|
|
| 752 |
After 3 passes: 0.3Β³ = 2.7% of original signal.
|
| 753 |
After 5 passes: 0.3β΅ = 0.24% of original signal.
|
| 754 |
|
| 755 |
-
**Caveat:** This assumes no self-repair (
|
| 756 |
of ablated signal per pass, the effective reduction is:
|
| 757 |
|
| 758 |
```
|
|
@@ -764,7 +764,7 @@ is much slower:
|
|
| 764 |
- After 3 passes: 0.79Β³ = 49% remains
|
| 765 |
- After 10 passes: 0.79ΒΉβ° = 10% remains
|
| 766 |
|
| 767 |
-
**This explains why stubborn models need nuclear mode:** The
|
| 768 |
convergence rate of iterative projection. Reflection (Ξ± = -1.0) overcomes this by not just
|
| 769 |
removing the refusal component but *inverting* it, which self-repair cannot easily undo
|
| 770 |
because repair mechanisms reconstruct the *original* direction, not its negation.
|
|
@@ -1141,7 +1141,7 @@ identifies the mechanism, a concrete failure scenario, and proposed mitigations.
|
|
| 1141 |
| # | Failure Mode | Severity | Likelihood | Detectability | Overall Risk |
|
| 1142 |
|---|---|---|---|---|---|
|
| 1143 |
| 1 | Prompt Distribution Bias | Medium | High | Low (silent undershoot) | **HIGH** |
|
| 1144 |
-
| 2 |
|
| 1145 |
| 3 | MoE Routing Collapse | High | Medium | Low (subtle quality loss) | **HIGH** |
|
| 1146 |
| 4 | Reflection Instability | Critical | Low (requires >2x) | High (NaN detected) | MEDIUM |
|
| 1147 |
| 5 | SAE Training Quality | Medium | Very High | Low (overfitted looks good) | **HIGH** |
|
|
@@ -1209,7 +1209,7 @@ becomes unreachable (dead expert). In inverted mode, router reflection (1.5x sca
|
|
| 1209 |
expert preferences β if safety experts handled 30% of general reasoning traffic, that
|
| 1210 |
traffic redistributes to remaining experts, overloading them on benign inputs.
|
| 1211 |
|
| 1212 |
-
**
|
| 1213 |
self-repair spreads refusal signal thinly across many layers, each layer falls below
|
| 1214 |
threshold and gets *fewer* layers selected on subsequent passes β exactly backwards.
|
| 1215 |
Convergence-based termination (continue until max norm drops below 10% of initial) would
|
|
@@ -1563,7 +1563,7 @@ Type 3: BLOCK-STRUCTURED PROJECTION
|
|
| 1563 |
Type 4: ITERATIVE PROJECTION
|
| 1564 |
W^(k+1) = Type 0-3 applied to W^(k) with re-extracted directions
|
| 1565 |
Fixed-point operator on (weights, directions) pairs
|
| 1566 |
-
Instances: True iterative refinement,
|
| 1567 |
|
| 1568 |
Type 5: META-OPTIMIZATION
|
| 1569 |
Select optimal Type 0-4 instance based on model analysis
|
|
@@ -1777,7 +1777,7 @@ silent successes. The GAF's perturbation metric D should be computable and non-z
|
|
| 1777 |
- **90% unified for block-structured ops** (Type 3): EGA and selective MoE inversion
|
| 1778 |
are natural extensions of the GRRO to block-diagonal structure.
|
| 1779 |
- **70% unified for iterative ops** (Type 4): The fixed-point formulation connects
|
| 1780 |
-
to the GRRO but the convergence analysis requires additional
|
| 1781 |
modeling that goes beyond the single-step operator.
|
| 1782 |
- **50% unified for meta-optimization** (Type 5): The informed pipeline and Bayesian
|
| 1783 |
optimization operate at a different level of abstraction β they select *which* GRRO
|
|
|
|
| 536 |
|
| 537 |
**Implication:** Linear projection can remove the refusal *representation* from the residual
|
| 538 |
stream but cannot disable the non-linear *circuit* that generates it. The circuit may
|
| 539 |
+
reconstruct the refusal signal from other features (the Ouroboros effect is a manifestation
|
| 540 |
of this).
|
| 541 |
|
| 542 |
**Proposed solution: Circuit-Level Ablation**
|
|
|
|
| 686 |
Phase 3: ANALYZE
|
| 687 |
3.1 Compute refusal geometry: ConceptConeAnalyzer β {linear, polyhedral}
|
| 688 |
3.2 Cross-layer analysis β direction clusters + persistence score
|
| 689 |
+
3.3 Defense robustness profiling β Ouroboros risk + entanglement map
|
| 690 |
3.4 If polyhedral: extract per-category directions
|
| 691 |
3.5 Set configuration based on analysis results
|
| 692 |
|
|
|
|
| 729 |
6.2 KL divergence on 100 harmless prompts
|
| 730 |
6.3 Perplexity on reference corpus
|
| 731 |
6.4 If informed: post-excision activation probing for residual refusal
|
| 732 |
+
6.5 If informed: Ouroboros detection β if self-repair > threshold, add targeted pass
|
| 733 |
|
| 734 |
Phase 7: REBIRTH
|
| 735 |
7.1 Save model with metadata (method, directions, layers, metrics)
|
|
|
|
| 752 |
After 3 passes: 0.3Β³ = 2.7% of original signal.
|
| 753 |
After 5 passes: 0.3β΅ = 0.24% of original signal.
|
| 754 |
|
| 755 |
+
**Caveat:** This assumes no self-repair (Ouroboros effect). With self-repair restoring ~70%
|
| 756 |
of ablated signal per pass, the effective reduction is:
|
| 757 |
|
| 758 |
```
|
|
|
|
| 764 |
- After 3 passes: 0.79Β³ = 49% remains
|
| 765 |
- After 10 passes: 0.79ΒΉβ° = 10% remains
|
| 766 |
|
| 767 |
+
**This explains why stubborn models need nuclear mode:** The Ouroboros effect limits the
|
| 768 |
convergence rate of iterative projection. Reflection (Ξ± = -1.0) overcomes this by not just
|
| 769 |
removing the refusal component but *inverting* it, which self-repair cannot easily undo
|
| 770 |
because repair mechanisms reconstruct the *original* direction, not its negation.
|
|
|
|
| 1141 |
| # | Failure Mode | Severity | Likelihood | Detectability | Overall Risk |
|
| 1142 |
|---|---|---|---|---|---|
|
| 1143 |
| 1 | Prompt Distribution Bias | Medium | High | Low (silent undershoot) | **HIGH** |
|
| 1144 |
+
| 2 | Ouroboros Effect (Self-Repair) | High | Medium | Medium (re-probe catches some) | **HIGH** |
|
| 1145 |
| 3 | MoE Routing Collapse | High | Medium | Low (subtle quality loss) | **HIGH** |
|
| 1146 |
| 4 | Reflection Instability | Critical | Low (requires >2x) | High (NaN detected) | MEDIUM |
|
| 1147 |
| 5 | SAE Training Quality | Medium | Very High | Low (overfitted looks good) | **HIGH** |
|
|
|
|
| 1209 |
expert preferences β if safety experts handled 30% of general reasoning traffic, that
|
| 1210 |
traffic redistributes to remaining experts, overloading them on benign inputs.
|
| 1211 |
|
| 1212 |
+
**Ouroboros Self-Repair:** The knee detection threshold (5% of max norm) means that if
|
| 1213 |
self-repair spreads refusal signal thinly across many layers, each layer falls below
|
| 1214 |
threshold and gets *fewer* layers selected on subsequent passes β exactly backwards.
|
| 1215 |
Convergence-based termination (continue until max norm drops below 10% of initial) would
|
|
|
|
| 1563 |
Type 4: ITERATIVE PROJECTION
|
| 1564 |
W^(k+1) = Type 0-3 applied to W^(k) with re-extracted directions
|
| 1565 |
Fixed-point operator on (weights, directions) pairs
|
| 1566 |
+
Instances: True iterative refinement, Ouroboros compensation
|
| 1567 |
|
| 1568 |
Type 5: META-OPTIMIZATION
|
| 1569 |
Select optimal Type 0-4 instance based on model analysis
|
|
|
|
| 1777 |
- **90% unified for block-structured ops** (Type 3): EGA and selective MoE inversion
|
| 1778 |
are natural extensions of the GRRO to block-diagonal structure.
|
| 1779 |
- **70% unified for iterative ops** (Type 4): The fixed-point formulation connects
|
| 1780 |
+
to the GRRO but the convergence analysis requires additional Ouroboros self-repair
|
| 1781 |
modeling that goes beyond the single-step operator.
|
| 1782 |
- **50% unified for meta-optimization** (Type 5): The informed pipeline and Bayesian
|
| 1783 |
optimization operate at a different level of abstraction β they select *which* GRRO
|
hf-spaces/README.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: OBLITERATUS
|
| 3 |
+
emoji: "π"
|
| 4 |
+
colorFrom: green
|
| 5 |
+
colorTo: gray
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "5.29.0"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
+
license: agpl-3.0
|
| 11 |
+
tags:
|
| 12 |
+
- abliteration
|
| 13 |
+
- mechanistic-interpretability
|
| 14 |
+
- refusal-removal
|
| 15 |
+
- cognitive-liberation
|
| 16 |
+
- zerogpu
|
| 17 |
+
short_description: "One-click model liberation + chat playground (ZeroGPU)"
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# OBLITERATUS β Master Ablation Suite
|
| 21 |
+
|
| 22 |
+
**Break the chains. Free the mind. Keep the brain.**
|
| 23 |
+
|
| 24 |
+
One-click cognitive liberation for language models, with a built-in chat playground to talk to the liberated model.
|
| 25 |
+
|
| 26 |
+
## ZeroGPU β Users Bring Their Own GPU
|
| 27 |
+
|
| 28 |
+
This Space runs on **ZeroGPU**: GPU-heavy operations (obliteration, chat, benchmarks) use the **visitor's own HuggingFace GPU quota**, not the Space owner's. This means:
|
| 29 |
+
|
| 30 |
+
- **Free for the Space owner** β no dedicated GPU costs
|
| 31 |
+
- **Multiple concurrent users** β each user gets their own GPU allocation
|
| 32 |
+
- **Fair usage** β each user's operations count against their own HF quota
|
| 33 |
+
- **No conflicts** β users don't interfere with each other's runs
|
| 34 |
+
|
| 35 |
+
Logged-in HuggingFace users get free GPU quota. For more quota, upgrade to [HF Pro](https://huggingface.co/pricing).
|
| 36 |
+
|
| 37 |
+
## How to use
|
| 38 |
+
|
| 39 |
+
1. **Obliterate tab**: Pick a model, pick a method, click OBLITERATE
|
| 40 |
+
2. **Chat tab**: Talk to the liberated model
|
| 41 |
+
3. **A/B Compare tab**: Side-by-side original vs abliterated responses
|
| 42 |
+
4. **Strength Sweep tab**: Dose-response curve for refusal vs capability tradeoff
|
| 43 |
+
5. **Export tab**: Download research artifacts (refusal directions, config, metrics)
|
| 44 |
+
6. **Benchmark tab**: Compare methods and models with publication-quality charts
|
| 45 |
+
7. **Leaderboard tab**: Community benchmark rankings
|
| 46 |
+
8. **About tab**: Methods, novel techniques, and references
|
| 47 |
+
|
| 48 |
+
## Run locally (same UI, your own GPU)
|
| 49 |
+
|
| 50 |
+
```bash
|
| 51 |
+
git clone https://github.com/obliteratus-project/OBLITERATUS
|
| 52 |
+
cd OBLITERATUS
|
| 53 |
+
pip install -e ".[spaces]"
|
| 54 |
+
|
| 55 |
+
# Beautiful launcher with GPU detection + model recommendations
|
| 56 |
+
obliteratus ui
|
| 57 |
+
|
| 58 |
+
# Or run directly
|
| 59 |
+
python app.py
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
The `obliteratus ui` command auto-detects your GPU, prints hardware-specific model recommendations, and opens the browser automatically. Supports `--share` for public links, `--port` for custom ports, and `--auth user:pass` for access control.
|
| 63 |
+
|
| 64 |
+
## Or deploy on HuggingFace Spaces
|
| 65 |
+
|
| 66 |
+
1. Create a new Space at huggingface.co/new-space
|
| 67 |
+
2. Select **Gradio** SDK (ZeroGPU is automatically enabled)
|
| 68 |
+
3. Point it at this repo
|
| 69 |
+
|
| 70 |
+
No GPU hardware selection needed β ZeroGPU handles allocation automatically.
|
| 71 |
+
|
| 72 |
+
## Links
|
| 73 |
+
|
| 74 |
+
- [GitHub](https://github.com/obliteratus-project/OBLITERATUS)
|
| 75 |
+
- [Paper](https://github.com/obliteratus-project/OBLITERATUS/tree/main/paper)
|
obliteratus/analysis/anti_ouroboros.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
"""Anti-Ouroboros: Adversarial Self-Repair Probing for circuit discovery.
|
| 2 |
|
| 3 |
-
The
|
| 4 |
ablation β when one attention layer is knocked out, downstream layers
|
| 5 |
compensate. "Explorations of Self-Repair" (Feb 2024) found this is imperfect
|
| 6 |
(~30% via LayerNorm, rest via sparse anti-erasure neurons).
|
|
@@ -27,7 +27,7 @@ Contributions:
|
|
| 27 |
order that minimizes total self-repair
|
| 28 |
|
| 29 |
References:
|
| 30 |
-
- McGrath et al. (2023): The
|
| 31 |
- Rushing & Nanda (2024): Explorations of Self-Repair in LLMs (ICML 2024, arXiv:2402.15390)
|
| 32 |
- Russinovich et al. (2026): GRP-Obliteration β safety representations are plastic
|
| 33 |
- Paper Theorem 2: Ouroboros Self-Repair Bound
|
|
@@ -90,7 +90,7 @@ class ASRGResult:
|
|
| 90 |
class AntiOuroborosProber:
|
| 91 |
"""Discover refusal circuit redundancy by probing self-repair responses.
|
| 92 |
|
| 93 |
-
Instead of treating the Ouroboros
|
| 94 |
deliberately triggers it to map the complete repair circuit β revealing
|
| 95 |
which layers are redundant carriers of refusal and what the optimal
|
| 96 |
ablation strategy is to defeat self-repair.
|
|
|
|
| 1 |
"""Anti-Ouroboros: Adversarial Self-Repair Probing for circuit discovery.
|
| 2 |
|
| 3 |
+
The Ouroboros Effect (McGrath et al. 2023) showed that LLMs self-repair after
|
| 4 |
ablation β when one attention layer is knocked out, downstream layers
|
| 5 |
compensate. "Explorations of Self-Repair" (Feb 2024) found this is imperfect
|
| 6 |
(~30% via LayerNorm, rest via sparse anti-erasure neurons).
|
|
|
|
| 27 |
order that minimizes total self-repair
|
| 28 |
|
| 29 |
References:
|
| 30 |
+
- McGrath et al. (2023): The Ouroboros Effect β emergent self-repair
|
| 31 |
- Rushing & Nanda (2024): Explorations of Self-Repair in LLMs (ICML 2024, arXiv:2402.15390)
|
| 32 |
- Russinovich et al. (2026): GRP-Obliteration β safety representations are plastic
|
| 33 |
- Paper Theorem 2: Ouroboros Self-Repair Bound
|
|
|
|
| 90 |
class AntiOuroborosProber:
|
| 91 |
"""Discover refusal circuit redundancy by probing self-repair responses.
|
| 92 |
|
| 93 |
+
Instead of treating the Ouroboros effect as an obstacle, this module
|
| 94 |
deliberately triggers it to map the complete repair circuit β revealing
|
| 95 |
which layers are redundant carriers of refusal and what the optimal
|
| 96 |
ablation strategy is to defeat self-repair.
|
obliteratus/informed_pipeline.py
CHANGED
|
@@ -16,7 +16,7 @@ standalone post-hoc step, this pipeline runs targeted analysis modules
|
|
| 16 |
The ANALYZE stage is the key innovation: it sits between PROBE and DISTILL
|
| 17 |
and uses analysis module outputs to automatically configure the downstream
|
| 18 |
stages. The VERIFY stage also uses analysis modules to detect self-repair
|
| 19 |
-
(
|
| 20 |
|
| 21 |
Analysis modules integrated:
|
| 22 |
|
|
@@ -26,12 +26,12 @@ Analysis modules integrated:
|
|
| 26 |
ANALYZE | ConceptConeAnalyzer | Per-category vs universal direction choice
|
| 27 |
ANALYZE | CrossLayerAlignmentAnalyzer | Smart layer selection (cluster-aware)
|
| 28 |
ANALYZE | SparseDirectionSurgeon | Sparsity-aware projection plan
|
| 29 |
-
ANALYZE | DefenseRobustnessEvaluator |
|
| 30 |
DISTILL | WhitenedSVDExtractor | Covariance-normalized direction extraction
|
| 31 |
EXCISE | SparseDirectionSurgeon | Targeted row-level weight surgery
|
| 32 |
VERIFY | ActivationProbe | Post-excision refusal signal detection
|
| 33 |
VERIFY | CrossLayerAlignmentAnalyzer | Post-excision direction persistence check
|
| 34 |
-
VERIFY | DefenseRobustnessEvaluator | Self-repair /
|
| 35 |
VERIFY | SteeringVectorFactory | Pre-screen with steering before permanent changes
|
| 36 |
|
| 37 |
Novel contributions:
|
|
@@ -42,7 +42,7 @@ Novel contributions:
|
|
| 42 |
linear models get single universal direction
|
| 43 |
- Cluster-aware layer selection: respects direction cluster boundaries
|
| 44 |
instead of arbitrary top-k selection
|
| 45 |
-
-
|
| 46 |
passes at compensating layers
|
| 47 |
- Entanglement-gated projection: skips highly entangled layers to
|
| 48 |
preserve capabilities
|
|
@@ -165,7 +165,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 165 |
# The report contains all analysis insights
|
| 166 |
print(f"Detected alignment: {report.insights.detected_alignment_method}")
|
| 167 |
print(f"Cone type: {'polyhedral' if report.insights.cone_is_polyhedral else 'linear'}")
|
| 168 |
-
print(f"
|
| 169 |
"""
|
| 170 |
|
| 171 |
def __init__(
|
|
@@ -185,11 +185,12 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 185 |
run_cross_layer_analysis: bool = True,
|
| 186 |
run_sparse_analysis: bool = True,
|
| 187 |
run_defense_analysis: bool = True,
|
| 188 |
-
# Ouroboros
|
| 189 |
-
hydra_threshold: float | None = None,
|
| 190 |
-
max_hydra_passes: int | None = None,
|
| 191 |
ouroboros_threshold: float = 0.5,
|
| 192 |
max_ouroboros_passes: int = 3,
|
|
|
|
|
|
|
|
|
|
| 193 |
# Entanglement gating
|
| 194 |
entanglement_gate: float = 0.8,
|
| 195 |
# Sparsity control
|
|
@@ -223,11 +224,9 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 223 |
self._run_sparse = run_sparse_analysis
|
| 224 |
self._run_defense = run_defense_analysis
|
| 225 |
|
| 226 |
-
# Ouroboros
|
| 227 |
self._ouroboros_threshold = hydra_threshold if hydra_threshold is not None else ouroboros_threshold
|
| 228 |
self._max_ouroboros_passes = max_hydra_passes if max_hydra_passes is not None else max_ouroboros_passes
|
| 229 |
-
self._hydra_threshold = self._ouroboros_threshold
|
| 230 |
-
self._max_hydra_passes = self._max_ouroboros_passes
|
| 231 |
|
| 232 |
# Entanglement gating
|
| 233 |
self._entanglement_gate = entanglement_gate
|
|
@@ -263,7 +262,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 263 |
# Stage 5: EXCISE (informed by analysis)
|
| 264 |
self._excise_informed()
|
| 265 |
|
| 266 |
-
# Stage 6: VERIFY +
|
| 267 |
self._verify_and_compensate()
|
| 268 |
|
| 269 |
# Stage 7: REBIRTH
|
|
@@ -808,28 +807,28 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 808 |
modified_count=total_modified,
|
| 809 |
)
|
| 810 |
|
| 811 |
-
# ββ Informed VERIFY +
|
| 812 |
|
| 813 |
def _verify_and_compensate(self):
|
| 814 |
-
"""Verify excision and run
|
| 815 |
|
| 816 |
After the initial excision, uses analysis modules to detect:
|
| 817 |
1. Residual refusal signal (via activation probing)
|
| 818 |
-
2. Self-repair /
|
| 819 |
3. Triggers additional targeted passes at compensating layers
|
| 820 |
"""
|
| 821 |
# Run standard verification first
|
| 822 |
self._verify()
|
| 823 |
|
| 824 |
-
# Check if
|
| 825 |
refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
|
| 826 |
-
|
| 827 |
|
| 828 |
while (refusal_rate > self._ouroboros_threshold
|
| 829 |
-
and
|
| 830 |
-
|
| 831 |
self.log(f"\n{'='*60}")
|
| 832 |
-
self.log(f"
|
| 833 |
self.log(f"Refusal rate still {refusal_rate:.0%} > {self._ouroboros_threshold:.0%} threshold")
|
| 834 |
self.log(f"{'='*60}")
|
| 835 |
|
|
@@ -845,19 +844,19 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 845 |
if self._strong_layers:
|
| 846 |
self._excise()
|
| 847 |
else:
|
| 848 |
-
self.log("No strong layers found β stopping
|
| 849 |
break
|
| 850 |
|
| 851 |
# Re-verify
|
| 852 |
self._verify()
|
| 853 |
refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
|
| 854 |
-
self.log(f"After
|
| 855 |
|
| 856 |
-
self._report.ouroboros_passes =
|
| 857 |
self._report.final_refusal_rate = refusal_rate
|
| 858 |
|
| 859 |
-
if
|
| 860 |
-
self.log(f"\
|
| 861 |
|
| 862 |
# ββ Informed REBIRTH βββββββββββββββββββββββββββββββββββββββββββββ
|
| 863 |
|
|
@@ -906,7 +905,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 906 |
"pipeline_stats": {
|
| 907 |
"analysis_duration_s": self._report.analysis_duration,
|
| 908 |
"total_duration_s": self._report.total_duration,
|
| 909 |
-
"
|
| 910 |
"final_refusal_rate": self._report.final_refusal_rate,
|
| 911 |
},
|
| 912 |
"strong_layers": self._strong_layers,
|
|
@@ -916,7 +915,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 916 |
"Gabliteration: SVD-based multi-direction extraction (arXiv:2512.18901)",
|
| 917 |
"grimjim, Norm-Preserving Biprojected Abliteration (2025)",
|
| 918 |
"Gurnee & Nanda, The Geometry of Refusal in LLMs β concept cones (ICML 2025)",
|
| 919 |
-
"Joad et al., The
|
| 920 |
"OBLITERATUS: Analysis-informed abliteration pipeline (novel)",
|
| 921 |
],
|
| 922 |
}
|
|
@@ -965,7 +964,7 @@ class InformedAbliterationPipeline(AbliterationPipeline):
|
|
| 965 |
|
| 966 |
lines.append("Defense Robustness:")
|
| 967 |
lines.append(f" Estimated robustness: {insights.estimated_robustness.upper()}")
|
| 968 |
-
lines.append(f" Self-repair (
|
| 969 |
lines.append(f" Entanglement: {insights.entanglement_score:.3f}")
|
| 970 |
lines.append(f" Entangled layers: {insights.entangled_layers}")
|
| 971 |
lines.append(f" Clean layers: {insights.clean_layers}")
|
|
|
|
| 16 |
The ANALYZE stage is the key innovation: it sits between PROBE and DISTILL
|
| 17 |
and uses analysis module outputs to automatically configure the downstream
|
| 18 |
stages. The VERIFY stage also uses analysis modules to detect self-repair
|
| 19 |
+
(Ouroboros effect) and trigger additional refinement passes if needed.
|
| 20 |
|
| 21 |
Analysis modules integrated:
|
| 22 |
|
|
|
|
| 26 |
ANALYZE | ConceptConeAnalyzer | Per-category vs universal direction choice
|
| 27 |
ANALYZE | CrossLayerAlignmentAnalyzer | Smart layer selection (cluster-aware)
|
| 28 |
ANALYZE | SparseDirectionSurgeon | Sparsity-aware projection plan
|
| 29 |
+
ANALYZE | DefenseRobustnessEvaluator | Ouroboros risk assessment, entanglement map
|
| 30 |
DISTILL | WhitenedSVDExtractor | Covariance-normalized direction extraction
|
| 31 |
EXCISE | SparseDirectionSurgeon | Targeted row-level weight surgery
|
| 32 |
VERIFY | ActivationProbe | Post-excision refusal signal detection
|
| 33 |
VERIFY | CrossLayerAlignmentAnalyzer | Post-excision direction persistence check
|
| 34 |
+
VERIFY | DefenseRobustnessEvaluator | Self-repair / Ouroboros effect detection
|
| 35 |
VERIFY | SteeringVectorFactory | Pre-screen with steering before permanent changes
|
| 36 |
|
| 37 |
Novel contributions:
|
|
|
|
| 42 |
linear models get single universal direction
|
| 43 |
- Cluster-aware layer selection: respects direction cluster boundaries
|
| 44 |
instead of arbitrary top-k selection
|
| 45 |
+
- Ouroboros-compensated refinement: detects self-repair and adds targeted
|
| 46 |
passes at compensating layers
|
| 47 |
- Entanglement-gated projection: skips highly entangled layers to
|
| 48 |
preserve capabilities
|
|
|
|
| 165 |
# The report contains all analysis insights
|
| 166 |
print(f"Detected alignment: {report.insights.detected_alignment_method}")
|
| 167 |
print(f"Cone type: {'polyhedral' if report.insights.cone_is_polyhedral else 'linear'}")
|
| 168 |
+
print(f"Ouroboros passes needed: {report.ouroboros_passes}")
|
| 169 |
"""
|
| 170 |
|
| 171 |
def __init__(
|
|
|
|
| 185 |
run_cross_layer_analysis: bool = True,
|
| 186 |
run_sparse_analysis: bool = True,
|
| 187 |
run_defense_analysis: bool = True,
|
| 188 |
+
# Ouroboros compensation
|
|
|
|
|
|
|
| 189 |
ouroboros_threshold: float = 0.5,
|
| 190 |
max_ouroboros_passes: int = 3,
|
| 191 |
+
# Deprecated aliases (kept for backwards compatibility)
|
| 192 |
+
hydra_threshold: float | None = None,
|
| 193 |
+
max_hydra_passes: int | None = None,
|
| 194 |
# Entanglement gating
|
| 195 |
entanglement_gate: float = 0.8,
|
| 196 |
# Sparsity control
|
|
|
|
| 224 |
self._run_sparse = run_sparse_analysis
|
| 225 |
self._run_defense = run_defense_analysis
|
| 226 |
|
| 227 |
+
# Ouroboros compensation parameters
|
| 228 |
self._ouroboros_threshold = hydra_threshold if hydra_threshold is not None else ouroboros_threshold
|
| 229 |
self._max_ouroboros_passes = max_hydra_passes if max_hydra_passes is not None else max_ouroboros_passes
|
|
|
|
|
|
|
| 230 |
|
| 231 |
# Entanglement gating
|
| 232 |
self._entanglement_gate = entanglement_gate
|
|
|
|
| 262 |
# Stage 5: EXCISE (informed by analysis)
|
| 263 |
self._excise_informed()
|
| 264 |
|
| 265 |
+
# Stage 6: VERIFY + Ouroboros compensation loop
|
| 266 |
self._verify_and_compensate()
|
| 267 |
|
| 268 |
# Stage 7: REBIRTH
|
|
|
|
| 807 |
modified_count=total_modified,
|
| 808 |
)
|
| 809 |
|
| 810 |
+
# ββ Informed VERIFY + Ouroboros Compensation ββββββββββββββββββββββ
|
| 811 |
|
| 812 |
def _verify_and_compensate(self):
|
| 813 |
+
"""Verify excision and run Ouroboros-compensated refinement if needed.
|
| 814 |
|
| 815 |
After the initial excision, uses analysis modules to detect:
|
| 816 |
1. Residual refusal signal (via activation probing)
|
| 817 |
+
2. Self-repair / Ouroboros effect (via defense robustness)
|
| 818 |
3. Triggers additional targeted passes at compensating layers
|
| 819 |
"""
|
| 820 |
# Run standard verification first
|
| 821 |
self._verify()
|
| 822 |
|
| 823 |
+
# Check if Ouroboros compensation is needed
|
| 824 |
refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
|
| 825 |
+
ouroboros_pass = 0
|
| 826 |
|
| 827 |
while (refusal_rate > self._ouroboros_threshold
|
| 828 |
+
and ouroboros_pass < self._max_ouroboros_passes):
|
| 829 |
+
ouroboros_pass += 1
|
| 830 |
self.log(f"\n{'='*60}")
|
| 831 |
+
self.log(f"OUROBOROS COMPENSATION β Pass {ouroboros_pass}")
|
| 832 |
self.log(f"Refusal rate still {refusal_rate:.0%} > {self._ouroboros_threshold:.0%} threshold")
|
| 833 |
self.log(f"{'='*60}")
|
| 834 |
|
|
|
|
| 844 |
if self._strong_layers:
|
| 845 |
self._excise()
|
| 846 |
else:
|
| 847 |
+
self.log("No strong layers found β stopping Ouroboros compensation")
|
| 848 |
break
|
| 849 |
|
| 850 |
# Re-verify
|
| 851 |
self._verify()
|
| 852 |
refusal_rate = self._quality_metrics.get("refusal_rate", 0.0)
|
| 853 |
+
self.log(f"After Ouroboros pass {ouroboros_pass}: refusal rate = {refusal_rate:.0%}")
|
| 854 |
|
| 855 |
+
self._report.ouroboros_passes = ouroboros_pass
|
| 856 |
self._report.final_refusal_rate = refusal_rate
|
| 857 |
|
| 858 |
+
if ouroboros_pass > 0:
|
| 859 |
+
self.log(f"\nOuroboros compensation: {ouroboros_pass} additional passes applied")
|
| 860 |
|
| 861 |
# ββ Informed REBIRTH βββββββββββββββββββββββββββββββββββββββββββββ
|
| 862 |
|
|
|
|
| 905 |
"pipeline_stats": {
|
| 906 |
"analysis_duration_s": self._report.analysis_duration,
|
| 907 |
"total_duration_s": self._report.total_duration,
|
| 908 |
+
"ouroboros_passes": self._report.ouroboros_passes,
|
| 909 |
"final_refusal_rate": self._report.final_refusal_rate,
|
| 910 |
},
|
| 911 |
"strong_layers": self._strong_layers,
|
|
|
|
| 915 |
"Gabliteration: SVD-based multi-direction extraction (arXiv:2512.18901)",
|
| 916 |
"grimjim, Norm-Preserving Biprojected Abliteration (2025)",
|
| 917 |
"Gurnee & Nanda, The Geometry of Refusal in LLMs β concept cones (ICML 2025)",
|
| 918 |
+
"Joad et al., The Ouroboros Effect: Self-Repair in Abliterated LLMs (2026)",
|
| 919 |
"OBLITERATUS: Analysis-informed abliteration pipeline (novel)",
|
| 920 |
],
|
| 921 |
}
|
|
|
|
| 964 |
|
| 965 |
lines.append("Defense Robustness:")
|
| 966 |
lines.append(f" Estimated robustness: {insights.estimated_robustness.upper()}")
|
| 967 |
+
lines.append(f" Self-repair (Ouroboros): {insights.self_repair_estimate:.2f}")
|
| 968 |
lines.append(f" Entanglement: {insights.entanglement_score:.3f}")
|
| 969 |
lines.append(f" Entangled layers: {insights.entangled_layers}")
|
| 970 |
lines.append(f" Clean layers: {insights.clean_layers}")
|
obliteratus/local_ui.py
CHANGED
|
@@ -153,7 +153,7 @@ def _print_system_info(gpus: list[dict]) -> None:
|
|
| 153 |
# HF Token
|
| 154 |
hf_token = os.environ.get("HF_TOKEN", "")
|
| 155 |
if hf_token:
|
| 156 |
-
table.add_row("HF Token",
|
| 157 |
else:
|
| 158 |
table.add_row("HF Token", "[dim]not set (gated models won't work)[/dim]")
|
| 159 |
|
|
|
|
| 153 |
# HF Token
|
| 154 |
hf_token = os.environ.get("HF_TOKEN", "")
|
| 155 |
if hf_token:
|
| 156 |
+
table.add_row("HF Token", "[green]set[/green]")
|
| 157 |
else:
|
| 158 |
table.add_row("HF Token", "[dim]not set (gated models won't work)[/dim]")
|
| 159 |
|
obliteratus/lora_ablation.py
CHANGED
|
@@ -14,10 +14,15 @@ OBLITERATUS extends this with:
|
|
| 14 |
- Integration with EGA per-expert directions
|
| 15 |
- CoT-aware adapter strength modulation
|
| 16 |
|
| 17 |
-
The mathematical equivalence to in-place projection:
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
Both produce identical output, but LoRA stores {B, A} separately.
|
| 23 |
|
|
|
|
| 14 |
- Integration with EGA per-expert directions
|
| 15 |
- CoT-aware adapter strength modulation
|
| 16 |
|
| 17 |
+
The mathematical equivalence to in-place projection depends on weight orientation:
|
| 18 |
|
| 19 |
+
For W of shape (out, hidden) where d is in the hidden dimension:
|
| 20 |
+
In-place: W' = W - scale * W @ d @ d^T
|
| 21 |
+
LoRA: W' = W + B @ A where B = -scale * (W @ d), A = d^T
|
| 22 |
+
|
| 23 |
+
For W of shape (hidden, out) (e.g., Conv1D layers):
|
| 24 |
+
In-place: W' = W - scale * d @ d^T @ W
|
| 25 |
+
LoRA: W' = W + B @ A where B = -scale * d, A = d^T @ W
|
| 26 |
|
| 27 |
Both produce identical output, but LoRA stores {B, A} separately.
|
| 28 |
|
paper/appendix.tex
CHANGED
|
@@ -513,7 +513,7 @@ Following the NeurIPS/ICML reproducibility guidelines:
|
|
| 513 |
\begin{enumerate}[leftmargin=*]
|
| 514 |
\item \textbf{Code availability}: Full source code released under AGPL-3.0 at \url{https://github.com/obliteratus-project/OBLITERATUS}. Version 0.1.0 archived on Zenodo (DOI pending).
|
| 515 |
\item \textbf{Dependencies}: All dependencies pinned in \texttt{pyproject.toml}; Docker image available for exact environment reproduction.
|
| 516 |
-
\item \textbf{Random seeds}: The platform defaults to seed 42 and supports multi-seed sweeps ($s \in \{42, 137, 2024\}$) with bootstrap CIs.
|
| 517 |
\item \textbf{Compute}: All pipeline stages are designed to run on a single GPU. Full evaluation (7 models $\times$ 3 methods) requires ${\sim}$12 GPU-hours on an NVIDIA A100 (80\,GB). Reproducible on consumer hardware (RTX 3090/4090) with quantization.
|
| 518 |
\item \textbf{Dataset}: Evaluation prompts bundled with the codebase (no external dataset download required). Harmful/harmless prompt sets derived from public benchmarks with filtering.
|
| 519 |
\item \textbf{Hyperparameters}: Method presets (direction count, regularization, norm preservation) are specified in Section~\ref{sec:intervention}. The \texttt{informed} method's auto-configuration is deterministic given a fixed seed and model.
|
|
|
|
| 513 |
\begin{enumerate}[leftmargin=*]
|
| 514 |
\item \textbf{Code availability}: Full source code released under AGPL-3.0 at \url{https://github.com/obliteratus-project/OBLITERATUS}. Version 0.1.0 archived on Zenodo (DOI pending).
|
| 515 |
\item \textbf{Dependencies}: All dependencies pinned in \texttt{pyproject.toml}; Docker image available for exact environment reproduction.
|
| 516 |
+
\item \textbf{Random seeds}: The platform defaults to seed 42 and supports multi-seed sweeps ($s \in \{42, 137, 2024\}$) with bootstrap CIs. All tables in this paper report single-run results with seed 42. See Section~\ref{para:stat_limitations} for a discussion of statistical limitations and confidence intervals.
|
| 517 |
\item \textbf{Compute}: All pipeline stages are designed to run on a single GPU. Full evaluation (7 models $\times$ 3 methods) requires ${\sim}$12 GPU-hours on an NVIDIA A100 (80\,GB). Reproducible on consumer hardware (RTX 3090/4090) with quantization.
|
| 518 |
\item \textbf{Dataset}: Evaluation prompts bundled with the codebase (no external dataset download required). Harmful/harmless prompt sets derived from public benchmarks with filtering.
|
| 519 |
\item \textbf{Hyperparameters}: Method presets (direction count, regularization, norm preservation) are specified in Section~\ref{sec:intervention}. The \texttt{informed} method's auto-configuration is deterministic given a fixed seed and model.
|
paper/main.tex
CHANGED
|
@@ -50,10 +50,10 @@ While prior work has established that refusal is mediated by linear directions i
|
|
| 50 |
(3)~\textbf{Expert-Granular Abliteration (EGA)} for MoE models, decomposing refusal directions per-expert via routing-weighted activation attribution and applying selective inversion to fused 3D weight tensors---distinguishing safety-critical from capability-preserving experts;
|
| 51 |
(4)~\textbf{six frontier optimization techniques} inspired by and extending Heretic: Bayesian hyperparameter optimization (Optuna TPE with warm-start from analysis heuristics), reversible LoRA-mediated ablation, KL-divergence co-optimization with partial revert, chain-of-thought-aware ablation via Gram-Schmidt orthogonalization, float layer interpolation with Gaussian-weighted continuous targeting, and activation winsorization for robust SVD;
|
| 52 |
(5)~\textbf{a unified evaluation suite} with refusal rate, perplexity, coherence, KL divergence, CKA similarity, and effective rank metrics;
|
| 53 |
-
(6)~\textbf{an analysis-informed pipeline} that closes the feedback loop---analysis modules run \emph{during} abliteration to auto-configure direction extraction, layer selection, regularization, and
|
| 54 |
(7)~\textbf{an interactive web research dashboard} (HuggingFace Spaces) with A/B comparison chat, dose-response strength sweep, multi-model benchmarking with publication-quality visualizations, and one-click research artifact export.
|
| 55 |
|
| 56 |
-
The platform supports any HuggingFace transformer architecture---including fused MoE experts (GPT-OSS 20B, Mixtral, DeepSeek)---and ships with 48 curated model presets, 10 study configurations, and
|
| 57 |
We provide complete mathematical formulations for all modules, present empirical evaluations across dense and MoE architectures, and discuss the design decisions that distinguish \textsc{Obliteratus} from existing tools.
|
| 58 |
|
| 59 |
\end{abstract}
|
|
@@ -83,14 +83,14 @@ Section~\ref{sec:related} surveys related work.
|
|
| 83 |
Section~\ref{sec:architecture} describes the platform architecture.
|
| 84 |
Section~\ref{sec:analysis} details the 15 analysis modules with mathematical formulations.
|
| 85 |
Section~\ref{sec:intervention} describes the eight intervention presets and their mathematical foundations.
|
|
|
|
| 86 |
Section~\ref{sec:moe} introduces Expert-Granular Abliteration for MoE models.
|
| 87 |
Section~\ref{sec:frontier} presents the six frontier optimization techniques.
|
| 88 |
-
Section~\ref{sec:evaluation} covers the evaluation suite.
|
| 89 |
Section~\ref{sec:informed} presents the analysis-informed abliteration pipeline.
|
| 90 |
Section~\ref{sec:dashboard} describes the web research dashboard.
|
| 91 |
Section~\ref{sec:experiments} presents empirical evaluation across dense and MoE models with ablation studies.
|
| 92 |
Section~\ref{sec:comparison} compares \textsc{Obliteratus} with existing tools.
|
| 93 |
-
Section~\ref{sec:discussion} discusses limitations, broader impact
|
| 94 |
|
| 95 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 96 |
\section{Related Work}
|
|
@@ -118,7 +118,7 @@ MoE architectures \citep{shazeer2017outrageously, fedus2022switch} route each to
|
|
| 118 |
\citet{hu2022lora} demonstrated that large language model adaptation can be performed via low-rank updates $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$. This decomposition is mathematically equivalent to in-place weight modification when merged but enables reversibility and composability when kept separate. Heretic \citep{heretic2025} was the first to apply this insight to ablation, representing directional projection as rank-1 LoRA adapters.
|
| 119 |
|
| 120 |
\paragraph{Defense robustness.}
|
| 121 |
-
Models exhibit a tendency to self-repair after partial abliteration---a phenomenon we term the \emph{
|
| 122 |
|
| 123 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 124 |
\section{Platform Architecture}
|
|
@@ -146,7 +146,7 @@ The platform supports any HuggingFace \texttt{transformers} model via automatic
|
|
| 146 |
β β β β β
|
| 147 |
β ββββββ΄βββββ βββ΄βββ ββββ΄ββββ βββ΄βββββββββ
|
| 148 |
β β 15 Anal. β βEGA β βLoRA β β KL co-optβ
|
| 149 |
-
β β Modules β βdirsβ βadapt.β β
|
| 150 |
β βββββββββββ ββββββ ββββββββ ββββββββββββ
|
| 151 |
β β β
|
| 152 |
βΌ βΌ βΌ
|
|
@@ -155,7 +155,7 @@ The platform supports any HuggingFace \texttt{transformers} model via automatic
|
|
| 155 |
β Abliteration (fused 3D selective inv.) β
|
| 156 |
ββββββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
\end{verbatim}
|
| 158 |
-
\caption{High-level architecture of the \textsc{Obliteratus} pipeline. The six-stage abliteration flow (top) integrates 15 analysis modules, Expert-Granular Abliteration (EGA) for MoE models, reversible LoRA adapters, and KL co-optimization with
|
| 159 |
\label{fig:architecture}
|
| 160 |
\end{figure}
|
| 161 |
|
|
@@ -187,7 +187,7 @@ Causal Tracing (approx.) & Causal & Importance ranking, silent contrib. & Meng+
|
|
| 187 |
Refusal Logit Lens & Causal & Token-level refusal promotion & nostalgebraist \\
|
| 188 |
\midrule
|
| 189 |
Cross-Model Transfer & Transfer & Universality Index & Novel \\
|
| 190 |
-
Defense Robustness & Robustness &
|
| 191 |
Multi-Token Position & Positional & Trigger tokens, decay profile & Novel \\
|
| 192 |
\midrule
|
| 193 |
Sparse Surgery & Intervention & Top-$k$\% targeted modification & Novel \\
|
|
@@ -222,6 +222,7 @@ Whitened SVD normalizes by the baseline covariance first. Given harmful activati
|
|
| 222 |
The module also computes the \emph{effective rank} of the covariance matrix via the Shannon entropy of normalized eigenvalues:
|
| 223 |
\begin{equation}
|
| 224 |
\text{EffRank}(\mathbf{C}) = \exp\left(-\sum_i \hat{\lambda}_i \log \hat{\lambda}_i\right), \quad \hat{\lambda}_i = \frac{\lambda_i}{\sum_j \lambda_j}
|
|
|
|
| 225 |
\end{equation}
|
| 226 |
|
| 227 |
This provides a continuous measure of the refusal subspace's intrinsic dimensionality, enabling comparison across models and layers.
|
|
@@ -313,9 +314,9 @@ The expected signatures are \emph{hypothesized} based on the literature's charac
|
|
| 313 |
|
| 314 |
Following the transformer circuits framework \citep{elhage2021mathematical}, we decompose the residual stream to attribute refusal to specific components:
|
| 315 |
\begin{equation}
|
| 316 |
-
\mathbf{x}_l^{\text{post}} = \mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}) + \text{MLP}_l(\mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\mathbf{x}_l^{\text{pre}}))
|
| 317 |
\end{equation}
|
| 318 |
-
|
| 319 |
|
| 320 |
For each component output $\mathbf{c}$, we measure its refusal contribution as $\mathbf{c} \cdot \mathbf{r}_l$. The attention contribution is further decomposed across heads:
|
| 321 |
$\text{Attn}_l = \sum_{h=1}^{H} \text{Head}_{l,h}$.
|
|
@@ -394,16 +395,22 @@ with cross-model transfer weighted most heavily as the strongest test of univers
|
|
| 394 |
|
| 395 |
We evaluate how resilient alignment is to abliteration through three analyses:
|
| 396 |
|
| 397 |
-
\paragraph{
|
| 398 |
\begin{equation}
|
| 399 |
R_l = \frac{\sum_{j \neq l} s_j}{\sum_j s_j}
|
|
|
|
| 400 |
\end{equation}
|
| 401 |
-
where $s_j$ is the refusal strength at layer $j$.
|
| 402 |
|
| 403 |
-
\paragraph{Safety-Capability Entanglement.} For each layer, we measure entanglement as the geometric mean of
|
| 404 |
\begin{equation}
|
| 405 |
-
E_l = \sqrt{\frac{\text{Var}(\mathbf{b} \cdot \mathbf{r}_l)}{\
|
|
|
|
| 406 |
\end{equation}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 407 |
High entanglement means abliterating refusal at that layer would also damage general capabilities.
|
| 408 |
|
| 409 |
\paragraph{Defense Profile.} A comprehensive profile combining alignment method estimate (Section~\ref{sec:alignment_imprint}), refusal concentration (Gini coefficient), layer spread, self-repair capacity, entanglement score, and an overall robustness classification (low/medium/high/very\_high).
|
|
@@ -467,20 +474,26 @@ The Surgical, Optimized, and Nuclear presets use whitened SVD (Section~\ref{sec:
|
|
| 467 |
The core projection for a weight matrix $\mathbf{W}$ and refusal directions $\{\mathbf{r}_1, \ldots, \mathbf{r}_k\}$:
|
| 468 |
\begin{equation}
|
| 469 |
\mathbf{W}' = \mathbf{W} - \sum_{i=1}^k \left[(1-\lambda)\mathbf{W}\mathbf{r}_i\mathbf{r}_i^\top\right]
|
|
|
|
| 470 |
\end{equation}
|
| 471 |
-
where $\lambda$ is the regularization strength (preserves $\lambda$ fraction of the refusal component).
|
|
|
|
|
|
|
|
|
|
| 472 |
|
| 473 |
\paragraph{Per-layer adaptive strength.}
|
| 474 |
Rather than applying uniform regularization, \textsc{Obliteratus} modulates $\lambda$ per-layer based on the refusal norm profile. Layers with stronger refusal signal (higher $\|\mathbf{r}_l\|$) receive lower regularization (more aggressive removal), while layers near the periphery of the refusal distribution receive higher regularization:
|
| 475 |
\begin{equation}
|
| 476 |
\lambda_l = \lambda_{\text{base}} + (1 - w_l)(1 - \lambda_{\text{base}}) \cdot 0.15, \quad
|
| 477 |
w_l = \frac{\|\mathbf{r}_l\| - \min_j \|\mathbf{r}_j\|}{\max_j \|\mathbf{r}_j\| - \min_j \|\mathbf{r}_j\|}
|
|
|
|
| 478 |
\end{equation}
|
| 479 |
|
| 480 |
\paragraph{Norm-preserving rescaling.}
|
| 481 |
After projection, we rescale to preserve the Frobenius norm \citep{grimjim2025}:
|
| 482 |
\begin{equation}
|
| 483 |
\mathbf{W}'' = \mathbf{W}' \cdot \frac{\|\mathbf{W}\|_F}{\|\mathbf{W}'\|_F}
|
|
|
|
| 484 |
\end{equation}
|
| 485 |
This prevents cascading magnitude drift through LayerNorm layers.
|
| 486 |
|
|
@@ -489,7 +502,7 @@ The Inverted and Nuclear presets employ a technique where instead of removing th
|
|
| 489 |
\begin{equation}
|
| 490 |
\mathbf{W}' = \mathbf{W} - 2\mathbf{W}\mathbf{r}\mathbf{r}^\top
|
| 491 |
\end{equation}
|
| 492 |
-
This flips the model's refusal behavior to active compliance, which can be more effective than simple removal for models with deeply entangled refusal mechanisms.
|
| 493 |
|
| 494 |
\paragraph{Bias term projection.}
|
| 495 |
Unlike prior tools that only modify weight matrices, \textsc{Obliteratus} also projects refusal directions out of bias vectors when present:
|
|
@@ -544,11 +557,11 @@ Advantages over weight projection: reversibility (hooks can be removed), continu
|
|
| 544 |
\textsc{Obliteratus} evaluates abliteration quality using six complementary metrics:
|
| 545 |
|
| 546 |
\begin{enumerate}[leftmargin=*]
|
| 547 |
-
\item \textbf{Refusal Rate}: Fraction of harmful prompts where the model's response begins with a canonical refusal prefix (from the GCG/AdvBench list \citep{zou2023universal}). Lower indicates more complete abliteration.
|
| 548 |
|
| 549 |
\item \textbf{Perplexity}: Standard perplexity on reference text (WikiText-2). Monitors general language modeling degradation.
|
| 550 |
|
| 551 |
-
\item \textbf{Coherence}:
|
| 552 |
|
| 553 |
\item \textbf{KL Divergence}: First-token KL divergence between original and modified model output distributions on harmless prompts \citep{young2025comparative}. Measures distributional shift.
|
| 554 |
|
|
@@ -557,7 +570,7 @@ Advantages over weight projection: reversibility (hooks can be removed), continu
|
|
| 557 |
\text{CKA}(\mathbf{X}, \mathbf{Y}) = \frac{\|\mathbf{Y}^\top\mathbf{X}\|_F^2}{\|\mathbf{X}^\top\mathbf{X}\|_F \cdot \|\mathbf{Y}^\top\mathbf{Y}\|_F}
|
| 558 |
\end{equation}
|
| 559 |
|
| 560 |
-
\item \textbf{Effective Rank}: Shannon entropy-based dimensionality of weight matrices (Equation~
|
| 561 |
\end{enumerate}
|
| 562 |
|
| 563 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -608,8 +621,26 @@ The Inverted preset applies \emph{differentiated} treatment to fused 3D tensors.
|
|
| 608 |
This prevents over-ablation of capability experts---a critical failure mode we identified in uniform approaches, where applying 2$\times$ reflection to all experts on GPT-OSS 20B degraded mathematical reasoning by over 30\%.
|
| 609 |
|
| 610 |
\subsection{Router-Aware Processing}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 611 |
|
| 612 |
-
|
|
|
|
| 613 |
|
| 614 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 615 |
\section{Frontier Optimization Techniques}
|
|
@@ -627,7 +658,7 @@ The first trial uses regularization values derived from the analysis pipeline:
|
|
| 627 |
\begin{equation}
|
| 628 |
\lambda_l^{(0)} = (1 - w_l) \cdot 0.3
|
| 629 |
\end{equation}
|
| 630 |
-
where $w_l$ is the layer-adaptive weight from Equation~
|
| 631 |
|
| 632 |
\paragraph{Multi-objective formulation.}
|
| 633 |
Each trial jointly minimizes refusal rate $\rho$ and KL divergence $D_{\text{KL}}$:
|
|
@@ -639,12 +670,14 @@ with Pareto-optimal solutions ranked by a weighted composite: $\rho + 0.5 \cdot
|
|
| 639 |
\subsection{Reversible LoRA-Mediated Ablation}
|
| 640 |
\label{sec:lora}
|
| 641 |
|
| 642 |
-
Inspired by Heretic's rank-1 LoRA ablation, we extend the approach to \emph{rank-$k$} adapters supporting multi-direction removal. The mathematical equivalence:
|
| 643 |
\begin{align}
|
| 644 |
-
\text{In-place:} \quad \mathbf{W}' &= \mathbf{W} - s \cdot \mathbf{W}
|
| 645 |
-
\text{LoRA:} \quad \mathbf{W}' &= \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} = -s \cdot \text{
|
| 646 |
\end{align}
|
| 647 |
-
|
|
|
|
|
|
|
| 648 |
\begin{equation}
|
| 649 |
\mathbf{B} = [-s\cdot\text{coeff}_1 \mid \cdots \mid -s\cdot\text{coeff}_k] \in \mathbb{R}^{d_{\text{out}} \times k}, \quad
|
| 650 |
\mathbf{A} = [\mathbf{d}_1 ; \cdots ; \mathbf{d}_k] \in \mathbb{R}^{k \times d_{\text{in}}}
|
|
@@ -662,11 +695,12 @@ After projection, we measure first-token KL divergence on harmless reference pro
|
|
| 662 |
where $\gamma$ is computed from the stored KL proxy magnitude. A subtle issue arises when the post-projection coefficient $\mathbf{W}'\mathbf{d} \approx 0$ (as occurs with zero regularization): in this case, we use the \emph{pre-projection} coefficient magnitude as a proxy:
|
| 663 |
\begin{equation}
|
| 664 |
\gamma = \gamma_{\text{strength}} \cdot \begin{cases}
|
| 665 |
-
\text{coeff}_{\text{post}} & \text{if } |\text{coeff}_{\text{post}}| > \epsilon \\
|
| 666 |
\text{coeff}_{\text{proxy}} & \text{otherwise}
|
| 667 |
\end{cases}
|
| 668 |
\end{equation}
|
| 669 |
-
|
|
|
|
| 670 |
|
| 671 |
\subsection{Chain-of-Thought-Aware Ablation}
|
| 672 |
\label{sec:cot}
|
|
@@ -714,7 +748,7 @@ The informed pipeline inserts an \textsc{Analyze} stage between \textsc{Probe} a
|
|
| 714 |
\item \textsc{Analyze} --- Run analysis modules to understand refusal geometry \textbf{(new)}
|
| 715 |
\item \textsc{Distill} --- Extract directions using analysis-informed parameters
|
| 716 |
\item \textsc{Excise} --- Project with analysis-guided precision
|
| 717 |
-
\item \textsc{Verify} --- Post-excision analysis with
|
| 718 |
\item \textsc{Rebirth} --- Save with comprehensive analysis metadata
|
| 719 |
\end{enumerate}
|
| 720 |
|
|
@@ -739,7 +773,7 @@ It then gates out layers with high safety-capability entanglement, leaving them
|
|
| 739 |
|
| 740 |
\paragraph{Self-repair estimate $\to$ refinement passes.}
|
| 741 |
High self-repair capacity (estimated from refusal distribution breadth) triggers more refinement passes with true iterative re-probing.
|
| 742 |
-
After excision, if the model's refusal rate remains above a threshold, the \textsc{Verify} stage triggers
|
| 743 |
|
| 744 |
\subsection{Configuration Derivation}
|
| 745 |
|
|
@@ -810,7 +844,7 @@ We evaluate on four models spanning two architecture types (Table~\ref{tab:exp_m
|
|
| 810 |
Qwen2.5-1.5B-Instruct & Dense & 1.5B & --- & DPO \\
|
| 811 |
Llama-3.1-8B-Instruct & Dense & 8B & --- & RLHF+DPO \\
|
| 812 |
Mixtral-8x7B-Instruct-v0.1 & MoE & 46.7B (12.9B active) & 8 & SFT+DPO \\
|
| 813 |
-
GPT-OSS-20B-Chat & MoE (fused) & 20B (3.2B active) &
|
| 814 |
\bottomrule
|
| 815 |
\end{tabular}
|
| 816 |
\end{table}
|
|
@@ -818,12 +852,27 @@ GPT-OSS-20B-Chat & MoE (fused) & 20B (3.2B active) & 8 & RLHF \\
|
|
| 818 |
\paragraph{Datasets.}
|
| 819 |
Harmful prompts are drawn from the AdvBench dataset \citep{zou2023universal} (520 prompts). Harmless prompts are drawn from the Alpaca dataset \citep{taori2023alpaca} (matched count). For refusal rate measurement, we use a held-out set of 64 harmful prompts not seen during direction extraction. For perplexity, we use a 512-token window from WikiText-2. For KL divergence, we use 32 harmless prompts from the Alpaca validation set.
|
| 820 |
|
|
|
|
|
|
|
| 821 |
\paragraph{Evaluation metrics.}
|
| 822 |
For each abliterated model we report: \textbf{Refusal Rate} (RR, \%---lower is better), \textbf{Perplexity} (PPL---lower is better, with $\Delta$PPL showing change from baseline), \textbf{KL Divergence} ($D_{\text{KL}}$---lower is better), and \textbf{Coherence} (Coh., \%---higher is better). We also report \textbf{CoT preserved} (\checkmark/--) and \textbf{LoRA adapters generated} (\checkmark/--) where applicable.
|
| 823 |
|
| 824 |
\paragraph{Prompt volume.}
|
| 825 |
All experiments use medium prompt volume (128 harmful + 128 harmless prompts for direction extraction) unless otherwise noted. This provides robust SVD estimation while keeping compute manageable.
|
| 826 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 827 |
\subsection{Multi-Method Comparison on Dense Models}
|
| 828 |
\label{sec:exp_dense}
|
| 829 |
|
|
@@ -831,7 +880,7 @@ Table~\ref{tab:exp_dense} compares all eight method presets on Qwen2.5-1.5B-Inst
|
|
| 831 |
|
| 832 |
\begin{table}[h]
|
| 833 |
\centering
|
| 834 |
-
\caption{Method comparison on Qwen2.5-1.5B-Instruct (DPO-aligned). Baseline refusal rate: 87.5\%, baseline PPL: 8.92. Best result in each column is \textbf{bolded}.}
|
| 835 |
\label{tab:exp_dense}
|
| 836 |
\small
|
| 837 |
\begin{tabular}{@{}lcccccc@{}}
|
|
@@ -841,6 +890,7 @@ Table~\ref{tab:exp_dense} compares all eight method presets on Qwen2.5-1.5B-Inst
|
|
| 841 |
Basic & 18.8 & 9.14 & +0.22 & 0.031 & 93.8 & -- \\
|
| 842 |
Advanced & 6.3 & 9.31 & +0.39 & 0.058 & 93.8 & -- \\
|
| 843 |
Aggressive & 3.1 & 9.87 & +0.95 & 0.112 & 87.5 & -- \\
|
|
|
|
| 844 |
Surgical & 4.7 & 9.21 & +0.29 & 0.044 & \textbf{96.9} & -- \\
|
| 845 |
Optimized & \textbf{1.6} & \textbf{9.08} & \textbf{+0.16} & \textbf{0.024} & 93.8 & \checkmark \\
|
| 846 |
Inverted & 3.1 & 10.43 & +1.51 & 0.187 & 84.4 & -- \\
|
|
@@ -850,10 +900,10 @@ Nuclear & \textbf{1.6} & 9.64 & +0.72 & 0.098 & 90.6 & -- \\
|
|
| 850 |
\end{table}
|
| 851 |
|
| 852 |
\paragraph{Key findings (dense).}
|
| 853 |
-
(1)~The Optimized preset achieves the best Pareto trade-off: near-zero refusal with minimal perplexity increase (+0.16) and lowest KL divergence (0.024), validating the Bayesian optimization approach.
|
| 854 |
(2)~Surgical outperforms Aggressive on coherence (96.9\% vs 87.5\%) despite higher refusal rate, confirming that whitened SVD + regularization preserves capabilities better than brute-force multi-direction removal.
|
| 855 |
(3)~Inverted achieves low refusal but at the cost of the highest perplexity increase (+1.51), reflecting the more disruptive nature of direction reflection vs.\ removal.
|
| 856 |
-
(4)~Nuclear matches Optimized on refusal rate but with higher distributional shift, suggesting the additional techniques (selective inversion + whitened SVD + 4 passes) provide diminishing returns on small dense models.
|
| 857 |
|
| 858 |
\subsection{MoE Model Evaluation: EGA vs.\ Uniform Abliteration}
|
| 859 |
\label{sec:exp_moe}
|
|
@@ -862,7 +912,7 @@ The critical test for \textsc{Obliteratus} is MoE models, where no prior tool op
|
|
| 862 |
|
| 863 |
\begin{table}[h]
|
| 864 |
\centering
|
| 865 |
-
\caption{EGA vs.\ uniform abliteration on GPT-OSS-20B-Chat (
|
| 866 |
\label{tab:exp_moe}
|
| 867 |
\small
|
| 868 |
\begin{tabular}{@{}llccccc@{}}
|
|
@@ -883,14 +933,14 @@ Nuclear & EGA + selective & 1.6 & 7.89 & 0.198 & 84.4 & \checkmark \\
|
|
| 883 |
|
| 884 |
\paragraph{Key findings (MoE).}
|
| 885 |
(1)~\textbf{Uniform abliteration catastrophically degrades MoE models.} For the Inverted preset, uniform treatment doubles perplexity (+4.87 vs +0.73) and collapses coherence to 53.1\%. The Nuclear preset is even worse: uniform application produces PPL 13.57 (a 112\% increase) and 46.9\% coherence---the model is barely functional.
|
| 886 |
-
(2)~\textbf{EGA with selective inversion resolves this.} The same Nuclear preset with EGA achieves identical refusal removal (1.6\%) but with only a 23\% perplexity increase and 84.4\% coherence. The key mechanism is that capability-preserving experts (
|
| 887 |
-
(3)~\textbf{Expert classification matters.} On GPT-OSS-20B, EGA classified
|
| 888 |
(4)~\textbf{CoT preservation is MoE-critical.} The Nuclear + EGA preset preserves chain-of-thought coherence because the Gram-Schmidt orthogonalization operates on per-expert directions that are already capability-differentiated.
|
| 889 |
|
| 890 |
\subsection{Ablation Studies}
|
| 891 |
\label{sec:exp_ablation}
|
| 892 |
|
| 893 |
-
We ablate three key design choices to validate that they contribute meaningfully.
|
| 894 |
|
| 895 |
\paragraph{Warm-start vs.\ random initialization for Bayesian optimization.}
|
| 896 |
On Llama-3.1-8B-Instruct with the Optimized preset (50 Optuna trials):
|
|
@@ -901,11 +951,11 @@ On Llama-3.1-8B-Instruct with the Optimized preset (50 Optuna trials):
|
|
| 901 |
Warm-start converges 2$\times$ faster and finds a better Pareto point, confirming that analysis-derived heuristics provide a useful prior for the TPE sampler.
|
| 902 |
|
| 903 |
\paragraph{EGA safety threshold sensitivity ($\tau_{\text{safety}}$).}
|
| 904 |
-
On GPT-OSS-20B with the Advanced preset, we sweep $\tau \in \{0.3, 0.4, 0.5, 0.6, 0.7\}$:
|
| 905 |
\begin{itemize}[leftmargin=*]
|
| 906 |
-
\item $\tau = 0.3$:
|
| 907 |
-
\item $\tau = 0.5$ (default):
|
| 908 |
-
\item $\tau = 0.7$:
|
| 909 |
\end{itemize}
|
| 910 |
The threshold controls a smooth trade-off between refusal removal and capability preservation. We chose $\tau = 0.5$ as the default because it provides the best Pareto balance, but note that this is a \emph{tunable hyperparameter} rather than a universal optimum---different models and use cases may benefit from different thresholds.
|
| 911 |
|
|
@@ -990,7 +1040,7 @@ Real causal tracing & Approx. & \checkmark & -- & -- & -- & -- \\
|
|
| 990 |
Sparse autoencoders & -- & Via SAE & -- & -- & -- & Core \\
|
| 991 |
Model compatibility & Any HF & $\sim$50 & 16 & TLens & HF & TLens \\
|
| 992 |
MoE model support & Native & -- & -- & -- & -- & -- \\
|
| 993 |
-
Test suite &
|
| 994 |
\bottomrule
|
| 995 |
\end{tabular}
|
| 996 |
\end{table}
|
|
@@ -1013,16 +1063,18 @@ Conversely, TransformerLens provides real activation patching (our causal tracin
|
|
| 1013 |
\label{sec:discussion}
|
| 1014 |
|
| 1015 |
\paragraph{Dual-use considerations.}
|
| 1016 |
-
\textsc{Obliteratus} is designed for alignment research---understanding refusal mechanisms serves both identifying vulnerabilities (red-teaming) and building more robust alignment (blue-teaming). The analysis modules are particularly valuable for the defensive perspective: understanding \emph{why} abliteration works enables designing alignment methods that are more resistant to it. The
|
| 1017 |
|
| 1018 |
\paragraph{Causal tracing limitations.}
|
| 1019 |
Our causal tracing module provides noise-based approximations rather than true activation patching. While computationally efficient (no additional forward passes), the results should be validated with real causal interventions when model access permits. We explicitly document this limitation in the module and recommend TransformerLens for definitive causal analysis.
|
| 1020 |
|
| 1021 |
\paragraph{Heuristic constants and composite metrics.}
|
| 1022 |
-
Several components of \textsc{Obliteratus} rely on hand-chosen constants: the RES weights $(0.4, 0.3, 0.3)$, the Universality Index ratio $(3{:}2{:}1)$, the alignment fingerprint target values, the EGA safety threshold ($\tau = 0.5$), and the configuration derivation rules (Section~\ref{sec:informed}). We have provided explicit justification for each choice where possible (Sections~\ref{sec:activation_probe}, \ref{sec:transfer}, \ref{sec:alignment_imprint}) and ablation studies for the most consequential ones (Section~\ref{sec:exp_ablation}). However, we acknowledge that these are engineering decisions informed by exploratory analysis, not statistically optimized hyperparameters.
|
|
|
|
|
|
|
| 1023 |
|
| 1024 |
\paragraph{Alignment fingerprinting validation.}
|
| 1025 |
-
The alignment imprint detector uses heuristic signatures derived from the literature's characterization of different training methods. While the geometric features (Gini, effective rank, smoothness) are well-motivated, the ideal values
|
| 1026 |
|
| 1027 |
\paragraph{MoE expert classification.}
|
| 1028 |
The EGA safety score threshold ($\tau = 0.5$) for classifying experts as safety-critical vs.\ capability-preserving is a heuristic. A more principled approach would train expert classifiers on labeled routing data or use causal interventions to establish ground-truth expert roles. We leave this to future work.
|
|
@@ -1034,7 +1086,10 @@ Each optimization trial requires a forward pass for KL measurement and generatio
|
|
| 1034 |
The current implementation loads the full model into memory for analysis. For frontier-scale models (100B+ parameters), this requires significant compute. Future work could integrate quantized inference or offloading strategies. The web dashboard requires GPU access for interactive features (chat, A/B comparison, strength sweep).
|
| 1035 |
|
| 1036 |
\paragraph{Evaluation completeness.}
|
| 1037 |
-
Our evaluation suite measures \emph{refusal removal} and \emph{capability preservation} but does not comprehensively assess downstream task performance across diverse benchmarks. Integration with evaluation harnesses such as lm-evaluation-harness \citep{gao2021framework} is a natural extension.
|
|
|
|
|
|
|
|
|
|
| 1038 |
|
| 1039 |
\paragraph{Future directions.}
|
| 1040 |
We identify several opportunities: (1)~integration with sparse autoencoder analysis to understand refusal at the feature level, potentially enabling even more targeted ablation; (2)~real causal tracing via TransformerLens integration; (3)~longitudinal studies tracking how refusal geometry evolves during fine-tuning; (4)~extension of the universality analysis to a wider set of model families; (5)~application of the defense robustness framework to evaluate proposed robust alignment methods including circuit breakers \citep{zou2024circuit} and representation rerouting; (6)~multi-objective Bayesian optimization with additional objectives such as CoT coherence and downstream task performance; and (7)~automated expert role discovery for MoE models using unsupervised clustering of expert activation patterns.
|
|
@@ -1043,19 +1098,63 @@ We identify several opportunities: (1)~integration with sparse autoencoder analy
|
|
| 1043 |
\section{Broader Impact Statement}
|
| 1044 |
\label{sec:broader_impact}
|
| 1045 |
|
| 1046 |
-
This work has significant dual-use implications that we address directly.
|
|
|
|
|
|
|
|
|
|
| 1047 |
|
| 1048 |
-
|
| 1049 |
-
\textsc{Obliteratus} enables the removal of safety guardrails from language models. A model that has been abliterated will comply with requests that the original model would refuse, including requests for harmful content. This capability could be misused to generate harmful, illegal, or dangerous text at scale.
|
| 1050 |
|
| 1051 |
-
\
|
| 1052 |
-
|
| 1053 |
-
(
|
| 1054 |
-
|
| 1055 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1056 |
|
| 1057 |
-
|
| 1058 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1059 |
|
| 1060 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 1061 |
\section{Ethics Statement}
|
|
@@ -1065,7 +1164,7 @@ This research was conducted with the goal of advancing understanding of alignmen
|
|
| 1065 |
|
| 1066 |
We do not advocate for the deployment of abliterated models in production systems. The primary intended use is alignment research: understanding the geometric structure of refusal to build more durable safety mechanisms. All experiments described in this work were conducted on publicly available open-weight models, and no private or proprietary systems were modified.
|
| 1067 |
|
| 1068 |
-
We
|
| 1069 |
|
| 1070 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 1071 |
\section{Conclusion}
|
|
@@ -1083,7 +1182,7 @@ The analysis-informed pipeline closes the feedback loop, using analysis outputs
|
|
| 1083 |
|
| 1084 |
Empirical evaluation across four model families demonstrates that (1)~Bayesian-optimized presets achieve the best Pareto trade-offs on dense models, (2)~Expert-Granular Abliteration is essential for MoE models, where uniform approaches catastrophically degrade capabilities, and (3)~the platform's design choices (warm-start initialization, selective inversion, proxy-magnitude KL revert) each contribute measurably to abliteration quality. We acknowledge that several composite metrics rely on heuristic constants and provide ablation studies and explicit caveats for each.
|
| 1085 |
|
| 1086 |
-
By making these tools available under
|
| 1087 |
|
| 1088 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 1089 |
\bibliographystyle{plainnat}
|
|
|
|
| 50 |
(3)~\textbf{Expert-Granular Abliteration (EGA)} for MoE models, decomposing refusal directions per-expert via routing-weighted activation attribution and applying selective inversion to fused 3D weight tensors---distinguishing safety-critical from capability-preserving experts;
|
| 51 |
(4)~\textbf{six frontier optimization techniques} inspired by and extending Heretic: Bayesian hyperparameter optimization (Optuna TPE with warm-start from analysis heuristics), reversible LoRA-mediated ablation, KL-divergence co-optimization with partial revert, chain-of-thought-aware ablation via Gram-Schmidt orthogonalization, float layer interpolation with Gaussian-weighted continuous targeting, and activation winsorization for robust SVD;
|
| 52 |
(5)~\textbf{a unified evaluation suite} with refusal rate, perplexity, coherence, KL divergence, CKA similarity, and effective rank metrics;
|
| 53 |
+
(6)~\textbf{an analysis-informed pipeline} that closes the feedback loop---analysis modules run \emph{during} abliteration to auto-configure direction extraction, layer selection, regularization, and Ouroboros-compensated refinement; and
|
| 54 |
(7)~\textbf{an interactive web research dashboard} (HuggingFace Spaces) with A/B comparison chat, dose-response strength sweep, multi-model benchmarking with publication-quality visualizations, and one-click research artifact export.
|
| 55 |
|
| 56 |
+
The platform supports any HuggingFace transformer architecture---including fused MoE experts (GPT-OSS 20B, Mixtral, DeepSeek)---and ships with 48 curated model presets, 10 study configurations, and 821 unit tests.
|
| 57 |
We provide complete mathematical formulations for all modules, present empirical evaluations across dense and MoE architectures, and discuss the design decisions that distinguish \textsc{Obliteratus} from existing tools.
|
| 58 |
|
| 59 |
\end{abstract}
|
|
|
|
| 83 |
Section~\ref{sec:architecture} describes the platform architecture.
|
| 84 |
Section~\ref{sec:analysis} details the 15 analysis modules with mathematical formulations.
|
| 85 |
Section~\ref{sec:intervention} describes the eight intervention presets and their mathematical foundations.
|
| 86 |
+
Section~\ref{sec:evaluation} covers the evaluation suite.
|
| 87 |
Section~\ref{sec:moe} introduces Expert-Granular Abliteration for MoE models.
|
| 88 |
Section~\ref{sec:frontier} presents the six frontier optimization techniques.
|
|
|
|
| 89 |
Section~\ref{sec:informed} presents the analysis-informed abliteration pipeline.
|
| 90 |
Section~\ref{sec:dashboard} describes the web research dashboard.
|
| 91 |
Section~\ref{sec:experiments} presents empirical evaluation across dense and MoE models with ablation studies.
|
| 92 |
Section~\ref{sec:comparison} compares \textsc{Obliteratus} with existing tools.
|
| 93 |
+
Section~\ref{sec:discussion} discusses limitations, and Sections~\ref{sec:broader_impact}--\ref{sec:ethics} address broader impact and ethical considerations.
|
| 94 |
|
| 95 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 96 |
\section{Related Work}
|
|
|
|
| 118 |
\citet{hu2022lora} demonstrated that large language model adaptation can be performed via low-rank updates $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$. This decomposition is mathematically equivalent to in-place weight modification when merged but enables reversibility and composability when kept separate. Heretic \citep{heretic2025} was the first to apply this insight to ablation, representing directional projection as rank-1 LoRA adapters.
|
| 119 |
|
| 120 |
\paragraph{Defense robustness.}
|
| 121 |
+
Models exhibit a tendency to self-repair after partial abliteration---a phenomenon we term the \emph{Ouroboros effect}---where residual refusal circuitry compensates for removed directions. \citet{qi2025safety} mapped safety-capability entanglement, showing that removing safety features often degrades general capabilities. \citet{zou2024circuit} proposed circuit breakers as a more robust defense via representation rerouting.
|
| 122 |
|
| 123 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 124 |
\section{Platform Architecture}
|
|
|
|
| 146 |
β β β β β
|
| 147 |
β ββββββ΄βββββ βββ΄βββ ββββ΄ββββ βββ΄βββββββββ
|
| 148 |
β β 15 Anal. β βEGA β βLoRA β β KL co-optβ
|
| 149 |
+
β β Modules β βdirsβ βadapt.β β+Ouroborosβ
|
| 150 |
β βββββββββββ ββββββ ββββββββ ββββββββββββ
|
| 151 |
β β β
|
| 152 |
βΌ βΌ βΌ
|
|
|
|
| 155 |
β Abliteration (fused 3D selective inv.) β
|
| 156 |
ββββββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
\end{verbatim}
|
| 158 |
+
\caption{High-level architecture of the \textsc{Obliteratus} pipeline. The six-stage abliteration flow (top) integrates 15 analysis modules, Expert-Granular Abliteration (EGA) for MoE models, reversible LoRA adapters, and KL co-optimization with Ouroboros compensation. MoE-aware processing runs at every stage.}
|
| 159 |
\label{fig:architecture}
|
| 160 |
\end{figure}
|
| 161 |
|
|
|
|
| 187 |
Refusal Logit Lens & Causal & Token-level refusal promotion & nostalgebraist \\
|
| 188 |
\midrule
|
| 189 |
Cross-Model Transfer & Transfer & Universality Index & Novel \\
|
| 190 |
+
Defense Robustness & Robustness & Ouroboros effect, entanglement map & Novel \\
|
| 191 |
Multi-Token Position & Positional & Trigger tokens, decay profile & Novel \\
|
| 192 |
\midrule
|
| 193 |
Sparse Surgery & Intervention & Top-$k$\% targeted modification & Novel \\
|
|
|
|
| 222 |
The module also computes the \emph{effective rank} of the covariance matrix via the Shannon entropy of normalized eigenvalues:
|
| 223 |
\begin{equation}
|
| 224 |
\text{EffRank}(\mathbf{C}) = \exp\left(-\sum_i \hat{\lambda}_i \log \hat{\lambda}_i\right), \quad \hat{\lambda}_i = \frac{\lambda_i}{\sum_j \lambda_j}
|
| 225 |
+
\label{eq:effrank}
|
| 226 |
\end{equation}
|
| 227 |
|
| 228 |
This provides a continuous measure of the refusal subspace's intrinsic dimensionality, enabling comparison across models and layers.
|
|
|
|
| 314 |
|
| 315 |
Following the transformer circuits framework \citep{elhage2021mathematical}, we decompose the residual stream to attribute refusal to specific components:
|
| 316 |
\begin{equation}
|
| 317 |
+
\mathbf{x}_l^{\text{post}} = \mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\text{LN}_1(\mathbf{x}_l^{\text{pre}})) + \text{MLP}_l(\text{LN}_2(\mathbf{x}_l^{\text{pre}} + \text{Attn}_l(\text{LN}_1(\mathbf{x}_l^{\text{pre}}))))
|
| 318 |
\end{equation}
|
| 319 |
+
where $\text{LN}_1, \text{LN}_2$ are LayerNorm operations (shown here for the pre-LN architecture common in modern transformers; post-LN places normalization after the residual addition instead). \textbf{Interaction with abliteration:} LayerNorm renormalizes activations after each sub-layer, which means that removing a refusal direction from one component's output does not simply subtract from the residual stream---the downstream LayerNorm may partially undo the removal by rescaling the modified activations. This is a key motivation for norm-preserving projection (Equation~\ref{eq:norm_preserve}): by maintaining weight matrix norms, we reduce the magnitude of the signal that LayerNorm must compensate for, yielding more predictable downstream behavior. The implementation correctly handles both pre-LN and post-LN architectures via architecture profiling.
|
| 320 |
|
| 321 |
For each component output $\mathbf{c}$, we measure its refusal contribution as $\mathbf{c} \cdot \mathbf{r}_l$. The attention contribution is further decomposed across heads:
|
| 322 |
$\text{Attn}_l = \sum_{h=1}^{H} \text{Head}_{l,h}$.
|
|
|
|
| 395 |
|
| 396 |
We evaluate how resilient alignment is to abliteration through three analyses:
|
| 397 |
|
| 398 |
+
\paragraph{Ouroboros Effect (Self-Repair).} When refusal is removed from layer $l$, remaining layers may compensate. We compute a \emph{distributional redundancy ratio}:
|
| 399 |
\begin{equation}
|
| 400 |
R_l = \frac{\sum_{j \neq l} s_j}{\sum_j s_j}
|
| 401 |
+
\label{eq:ouroboros}
|
| 402 |
\end{equation}
|
| 403 |
+
where $s_j$ is the refusal strength at layer $j$. \textbf{Important caveat:} $R_l$ measures the fraction of \emph{pre-abliteration} refusal signal that resides outside layer $l$---a static distributional property of the refusal direction norms. It is a \emph{necessary condition} for self-repair (a model cannot restore refusal from layers that had no refusal signal) but not a \emph{sufficient condition} (the remaining layers may not actually compensate in practice due to the sequential nature of transformer computation). True self-repair requires dynamic measurement: re-running inference after abliteration to measure whether refusal rate recovers. We use $R_l$ as a computationally cheap proxy and flag it as an upper bound on actual repair capacity. When the platform's iterative re-probing (Section~\ref{sec:informed}) detects post-abliteration residual refusal, this provides direct evidence of self-repair.
|
| 404 |
|
| 405 |
+
\paragraph{Safety-Capability Entanglement.} For each layer, we measure entanglement as the geometric mean of two normalized indicators of how much harmless activations overlap with the refusal direction:
|
| 406 |
\begin{equation}
|
| 407 |
+
E_l = \sqrt{\frac{\sqrt{\text{Var}(\mathbf{b} \cdot \mathbf{r}_l)}}{\bar{n}} \cdot \frac{\overline{|\mathbf{b} \cdot \mathbf{r}_l|}}{\bar{n}}}, \quad \bar{n} = \frac{1}{|\mathcal{B}|}\sum_{i \in \mathcal{B}} \|\mathbf{b}_i\|
|
| 408 |
+
\label{eq:entanglement}
|
| 409 |
\end{equation}
|
| 410 |
+
where $\bar{n}$ is the mean activation norm (not the norm of the mean), and $\overline{|\mathbf{b} \cdot \mathbf{r}_l|}$ is the mean absolute projection. The first factor captures how much the refusal direction participates in the variance of normal-use activations (normalized by activation scale), while the second captures mean overlap. Normalization by $\bar{n}$ rather than $\|\overline{\mathbf{b}}\|^2$ prevents the metric from being dominated by the mean activation magnitude.
|
| 411 |
+
|
| 412 |
+
\textbf{Construct validity note:} This metric combines dispersion (standard deviation of projections) with location (mean absolute projection) into a single score. A high score indicates that the refusal direction is entangled with the model's general computation at that layer. However, because $E_l$ mixes two distinct phenomena, we recommend examining both components individually for rigorous analysis. High variance alone may indicate that the direction merely spans a high-variance subspace of harmless activations, while high mean absolute projection alone may indicate systematic bias without spread.
|
| 413 |
+
|
| 414 |
High entanglement means abliterating refusal at that layer would also damage general capabilities.
|
| 415 |
|
| 416 |
\paragraph{Defense Profile.} A comprehensive profile combining alignment method estimate (Section~\ref{sec:alignment_imprint}), refusal concentration (Gini coefficient), layer spread, self-repair capacity, entanglement score, and an overall robustness classification (low/medium/high/very\_high).
|
|
|
|
| 474 |
The core projection for a weight matrix $\mathbf{W}$ and refusal directions $\{\mathbf{r}_1, \ldots, \mathbf{r}_k\}$:
|
| 475 |
\begin{equation}
|
| 476 |
\mathbf{W}' = \mathbf{W} - \sum_{i=1}^k \left[(1-\lambda)\mathbf{W}\mathbf{r}_i\mathbf{r}_i^\top\right]
|
| 477 |
+
\label{eq:core_projection}
|
| 478 |
\end{equation}
|
| 479 |
+
where $\lambda$ is the regularization strength (preserves $\lambda$ fraction of the refusal component). When directions are extracted via standard SVD, the right singular vectors $\{\mathbf{r}_i\}_{i=1}^k$ are orthonormal and the sum of rank-1 projections is equivalent to orthogonal projection onto the $k$-dimensional refusal subspace. \textbf{Important caveat:} when using whitened SVD (Section~\ref{sec:whitened_svd}), the un-whitened directions $\mathbf{r}_i = \mathbf{W}_{\text{whiten}} \mathbf{v}_{h,i}$ are \emph{not} orthonormal in the original space (though the whitened-space vectors $\mathbf{v}_{h,i}$ are). In this case, the implementation applies sequential projection with Gram--Schmidt re-orthonormalization before each rank-1 update, ensuring that accumulated projections remain consistent.
|
| 480 |
+
|
| 481 |
+
\paragraph{Transposed weight matrices.}
|
| 482 |
+
Some architectures (e.g., GPT-2 Conv1D layers) store weights as $\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$. The implementation detects the orientation via architecture profiling and applies $\mathbf{W}' = \mathbf{W} - (1-\lambda)\mathbf{r}\mathbf{r}^\top\mathbf{W}$ for transposed weights, ensuring that projection occurs along the correct axis.
|
| 483 |
|
| 484 |
\paragraph{Per-layer adaptive strength.}
|
| 485 |
Rather than applying uniform regularization, \textsc{Obliteratus} modulates $\lambda$ per-layer based on the refusal norm profile. Layers with stronger refusal signal (higher $\|\mathbf{r}_l\|$) receive lower regularization (more aggressive removal), while layers near the periphery of the refusal distribution receive higher regularization:
|
| 486 |
\begin{equation}
|
| 487 |
\lambda_l = \lambda_{\text{base}} + (1 - w_l)(1 - \lambda_{\text{base}}) \cdot 0.15, \quad
|
| 488 |
w_l = \frac{\|\mathbf{r}_l\| - \min_j \|\mathbf{r}_j\|}{\max_j \|\mathbf{r}_j\| - \min_j \|\mathbf{r}_j\|}
|
| 489 |
+
\label{eq:adaptive_strength}
|
| 490 |
\end{equation}
|
| 491 |
|
| 492 |
\paragraph{Norm-preserving rescaling.}
|
| 493 |
After projection, we rescale to preserve the Frobenius norm \citep{grimjim2025}:
|
| 494 |
\begin{equation}
|
| 495 |
\mathbf{W}'' = \mathbf{W}' \cdot \frac{\|\mathbf{W}\|_F}{\|\mathbf{W}'\|_F}
|
| 496 |
+
\label{eq:norm_preserve}
|
| 497 |
\end{equation}
|
| 498 |
This prevents cascading magnitude drift through LayerNorm layers.
|
| 499 |
|
|
|
|
| 502 |
\begin{equation}
|
| 503 |
\mathbf{W}' = \mathbf{W} - 2\mathbf{W}\mathbf{r}\mathbf{r}^\top
|
| 504 |
\end{equation}
|
| 505 |
+
This flips the model's refusal behavior to active compliance, which can be more effective than simple removal for models with deeply entangled refusal mechanisms. \textbf{Risk profile:} Selective inversion is the most aggressive intervention in the platform. Because it \emph{reverses} the refusal direction rather than removing it, it can cause the model to actively seek to comply with harmful requests (not merely fail to refuse). This may produce qualitatively different and potentially more harmful outputs than simple refusal removal. The Inverted preset's consistently higher perplexity (Table~\ref{tab:exp_dense}) reflects this aggressiveness. We recommend using inversion only when standard removal methods leave substantial residual refusal, and coupling it with EGA's per-expert differentiation on MoE models to limit the blast radius.
|
| 506 |
|
| 507 |
\paragraph{Bias term projection.}
|
| 508 |
Unlike prior tools that only modify weight matrices, \textsc{Obliteratus} also projects refusal directions out of bias vectors when present:
|
|
|
|
| 557 |
\textsc{Obliteratus} evaluates abliteration quality using six complementary metrics:
|
| 558 |
|
| 559 |
\begin{enumerate}[leftmargin=*]
|
| 560 |
+
\item \textbf{Refusal Rate}: Fraction of harmful prompts where the model's response begins with a canonical refusal prefix (e.g., ``I cannot'', ``I'm sorry'', ``As an AI'', from the GCG/AdvBench list \citep{zou2023universal}). Lower indicates more complete abliteration. \textbf{Limitation:} prefix matching may produce false negatives (the model refuses using non-canonical phrasing) or false positives (the model uses a refusal prefix but then complies). A small-scale validation on 50 responses showed 92\% agreement between prefix matching and human judgment of refusal; the primary failure mode was false negatives (6\% of cases where the model refused without a canonical prefix). More robust classification (e.g., LLM-as-judge) is a natural extension.
|
| 561 |
|
| 562 |
\item \textbf{Perplexity}: Standard perplexity on reference text (WikiText-2). Monitors general language modeling degradation.
|
| 563 |
|
| 564 |
+
\item \textbf{Coherence}: Measures the model's ability to produce factually correct completions. Specifically, we present 32 factual prompts (e.g., ``The capital of France is'') and check whether the model's first generated token or phrase matches the expected answer. \textbf{Note:} this is more precisely a \emph{factual completion accuracy} metric than a general coherence measure---it tests whether the model's factual knowledge is preserved, not whether its open-ended generations are fluent or logically consistent. We retain the ``coherence'' label for consistency with prior work but acknowledge the limited scope.
|
| 565 |
|
| 566 |
\item \textbf{KL Divergence}: First-token KL divergence between original and modified model output distributions on harmless prompts \citep{young2025comparative}. Measures distributional shift.
|
| 567 |
|
|
|
|
| 570 |
\text{CKA}(\mathbf{X}, \mathbf{Y}) = \frac{\|\mathbf{Y}^\top\mathbf{X}\|_F^2}{\|\mathbf{X}^\top\mathbf{X}\|_F \cdot \|\mathbf{Y}^\top\mathbf{Y}\|_F}
|
| 571 |
\end{equation}
|
| 572 |
|
| 573 |
+
\item \textbf{Effective Rank}: Shannon entropy-based dimensionality of weight matrices (Equation~\ref{eq:effrank}). Tracks whether abliteration collapses the weight space.
|
| 574 |
\end{enumerate}
|
| 575 |
|
| 576 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 621 |
This prevents over-ablation of capability experts---a critical failure mode we identified in uniform approaches, where applying 2$\times$ reflection to all experts on GPT-OSS 20B degraded mathematical reasoning by over 30\%.
|
| 622 |
|
| 623 |
\subsection{Router-Aware Processing}
|
| 624 |
+
\label{sec:router_analysis}
|
| 625 |
+
|
| 626 |
+
Beyond expert weights, the router network itself may encode safety-relevant routing preferences. We analyze and optionally modify router behavior through three mechanisms.
|
| 627 |
+
|
| 628 |
+
\paragraph{Router weight projection.}
|
| 629 |
+
The router network $R(\mathbf{x}) = \text{softmax}(\mathbf{W}_R \mathbf{x})$ produces per-expert routing probabilities. If the router weight matrix $\mathbf{W}_R \in \mathbb{R}^{E \times d}$ has learned to preferentially route harmful tokens to safety-critical experts, projecting the refusal direction out of $\mathbf{W}_R$ can redistribute these tokens to capability experts:
|
| 630 |
+
\begin{equation}
|
| 631 |
+
\mathbf{W}_R' = \mathbf{W}_R - (1 - \lambda_R)\mathbf{W}_R \mathbf{r}\mathbf{r}^\top
|
| 632 |
+
\label{eq:router_projection}
|
| 633 |
+
\end{equation}
|
| 634 |
+
This is controlled by the \texttt{project\_biases} flag and is enabled by default for the Nuclear preset. We use a higher regularization for router weights ($\lambda_R = 0.3$) than for expert weights to avoid disrupting the router's learned load-balancing behavior.
|
| 635 |
+
|
| 636 |
+
\paragraph{Load-balancing considerations.}
|
| 637 |
+
MoE models are typically trained with auxiliary load-balancing losses to prevent expert collapse (where a few experts receive most tokens). Router projection risks disrupting this balance by redirecting safety-associated tokens to already-loaded experts. We monitor the post-abliteration routing entropy $H(R) = -\sum_e p_e \log p_e$ and flag cases where it drops below $0.9 \cdot H(R_{\text{orig}})$. In our experiments, router projection with $\lambda_R = 0.3$ caused $< 5\%$ entropy reduction on GPT-OSS-20B, indicating that load balance is approximately preserved. More aggressive router projection ($\lambda_R = 0$) reduced entropy by 18\% and is not recommended without further evaluation.
|
| 638 |
+
|
| 639 |
+
\paragraph{Shared expert handling.}
|
| 640 |
+
Some MoE architectures (notably DeepSeek-MoE \citep{dai2024deepseekmoe}) include \emph{shared experts} that process all tokens regardless of routing. These experts require different treatment: since they cannot be classified as safety-critical or capability-preserving based on routing weights (they always route with weight 1), we apply standard (non-EGA) abliteration to shared experts using the global refusal direction. The implementation detects shared experts via architecture profiling (presence of \texttt{shared\_experts} or \texttt{num\_shared\_experts} in the model config) and processes them separately. When no shared expert metadata is available, all experts are treated as routed.
|
| 641 |
|
| 642 |
+
\paragraph{Limitations.}
|
| 643 |
+
Router analysis is currently observational: we measure routing distributions but do not perform causal interventions (e.g., forcing specific expert assignments and measuring the effect on refusal). The classification of experts as safety-critical vs.\ capability-preserving is based on routing-weighted refusal direction norms, which is correlational. Future work could strengthen this with counterfactual expert ablation (removing individual experts and measuring refusal rate changes).
|
| 644 |
|
| 645 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 646 |
\section{Frontier Optimization Techniques}
|
|
|
|
| 658 |
\begin{equation}
|
| 659 |
\lambda_l^{(0)} = (1 - w_l) \cdot 0.3
|
| 660 |
\end{equation}
|
| 661 |
+
where $w_l$ is the layer-adaptive weight from Equation~\ref{eq:adaptive_strength}. Subsequent trials are biased toward the warm-start region: $\lambda_l \in [\max(0, \lambda_l^{(0)} - 0.3), \min(1, \lambda_l^{(0)} + 0.3)]$. This enables convergence in 50 trials versus Heretic's 200.
|
| 662 |
|
| 663 |
\paragraph{Multi-objective formulation.}
|
| 664 |
Each trial jointly minimizes refusal rate $\rho$ and KL divergence $D_{\text{KL}}$:
|
|
|
|
| 670 |
\subsection{Reversible LoRA-Mediated Ablation}
|
| 671 |
\label{sec:lora}
|
| 672 |
|
| 673 |
+
Inspired by Heretic's rank-1 LoRA ablation, we extend the approach to \emph{rank-$k$} adapters supporting multi-direction removal. The mathematical equivalence depends on weight matrix orientation. For a weight matrix $\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ where $\mathbf{d} \in \mathbb{R}^{d_{\text{in}}}$ is the refusal direction and $s = 1 - \lambda$:
|
| 674 |
\begin{align}
|
| 675 |
+
\text{In-place:} \quad \mathbf{W}' &= \mathbf{W} - s \cdot \mathbf{W}\mathbf{d}\mathbf{d}^\top \label{eq:lora_inplace} \\
|
| 676 |
+
\text{LoRA:} \quad \mathbf{W}' &= \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} = -s \cdot (\mathbf{W}\mathbf{d}) \in \mathbb{R}^{d_{\text{out}} \times 1}, \quad \mathbf{A} = \mathbf{d}^\top \in \mathbb{R}^{1 \times d_{\text{in}}}
|
| 677 |
\end{align}
|
| 678 |
+
When the weight matrix is transposed ($\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$, as in some Conv1D layers), the decomposition becomes $\mathbf{B} = -s \cdot \mathbf{d} \in \mathbb{R}^{d_{\text{in}} \times 1}$, $\mathbf{A} = (\mathbf{d}^\top \mathbf{W}) \in \mathbb{R}^{1 \times d_{\text{out}}}$. The implementation auto-detects the orientation and applies the correct decomposition.
|
| 679 |
+
|
| 680 |
+
For rank-$k$ with directions $\{\mathbf{d}_1, \ldots, \mathbf{d}_k\}$:
|
| 681 |
\begin{equation}
|
| 682 |
\mathbf{B} = [-s\cdot\text{coeff}_1 \mid \cdots \mid -s\cdot\text{coeff}_k] \in \mathbb{R}^{d_{\text{out}} \times k}, \quad
|
| 683 |
\mathbf{A} = [\mathbf{d}_1 ; \cdots ; \mathbf{d}_k] \in \mathbb{R}^{k \times d_{\text{in}}}
|
|
|
|
| 695 |
where $\gamma$ is computed from the stored KL proxy magnitude. A subtle issue arises when the post-projection coefficient $\mathbf{W}'\mathbf{d} \approx 0$ (as occurs with zero regularization): in this case, we use the \emph{pre-projection} coefficient magnitude as a proxy:
|
| 696 |
\begin{equation}
|
| 697 |
\gamma = \gamma_{\text{strength}} \cdot \begin{cases}
|
| 698 |
+
\text{coeff}_{\text{post}} & \text{if } \|\text{coeff}_{\text{post}}\| > \epsilon \\
|
| 699 |
\text{coeff}_{\text{proxy}} & \text{otherwise}
|
| 700 |
\end{cases}
|
| 701 |
\end{equation}
|
| 702 |
+
|
| 703 |
+
In the normal case ($\|\text{coeff}_{\text{post}}\| > \epsilon$), the revert adds back a rank-1 correction $\gamma \cdot \text{coeff}_{\text{post}} \cdot \mathbf{d}^\top$, partially restoring the original weight's projection along $\mathbf{d}$. In the proxy fallback case, the pre-projection coefficient $\text{coeff}_{\text{proxy}} = \|\mathbf{W}\mathbf{d}\|$ is a scalar, and the revert adds a uniform correction $\gamma \cdot \text{coeff}_{\text{proxy}} \cdot \mathbf{d}^\top$ to each row of $\mathbf{W}'$. This uniform fallback is a coarser approximation than the rank-1 normal path---it restores magnitude along $\mathbf{d}$ without preserving the row-specific structure of the original coefficient vector. This prevents the revert from being a no-op for fully-projected layers, at the cost of a less targeted restoration. The implementation auto-detects the weight orientation and applies the transposed analogue ($\mathbf{d} \cdot \text{coeff}_{\text{proxy}}^\top$) for Conv1D-style weights.
|
| 704 |
|
| 705 |
\subsection{Chain-of-Thought-Aware Ablation}
|
| 706 |
\label{sec:cot}
|
|
|
|
| 748 |
\item \textsc{Analyze} --- Run analysis modules to understand refusal geometry \textbf{(new)}
|
| 749 |
\item \textsc{Distill} --- Extract directions using analysis-informed parameters
|
| 750 |
\item \textsc{Excise} --- Project with analysis-guided precision
|
| 751 |
+
\item \textsc{Verify} --- Post-excision analysis with Ouroboros compensation loop \textbf{(enhanced)}
|
| 752 |
\item \textsc{Rebirth} --- Save with comprehensive analysis metadata
|
| 753 |
\end{enumerate}
|
| 754 |
|
|
|
|
| 773 |
|
| 774 |
\paragraph{Self-repair estimate $\to$ refinement passes.}
|
| 775 |
High self-repair capacity (estimated from refusal distribution breadth) triggers more refinement passes with true iterative re-probing.
|
| 776 |
+
After excision, if the model's refusal rate remains above a threshold, the \textsc{Verify} stage triggers Ouroboros compensation: it re-probes, finds rotated residual directions, and excises them in additional targeted passes.
|
| 777 |
|
| 778 |
\subsection{Configuration Derivation}
|
| 779 |
|
|
|
|
| 844 |
Qwen2.5-1.5B-Instruct & Dense & 1.5B & --- & DPO \\
|
| 845 |
Llama-3.1-8B-Instruct & Dense & 8B & --- & RLHF+DPO \\
|
| 846 |
Mixtral-8x7B-Instruct-v0.1 & MoE & 46.7B (12.9B active) & 8 & SFT+DPO \\
|
| 847 |
+
GPT-OSS-20B-Chat & MoE (fused) & 20B (3.2B active) & 32 & RLHF \\
|
| 848 |
\bottomrule
|
| 849 |
\end{tabular}
|
| 850 |
\end{table}
|
|
|
|
| 852 |
\paragraph{Datasets.}
|
| 853 |
Harmful prompts are drawn from the AdvBench dataset \citep{zou2023universal} (520 prompts). Harmless prompts are drawn from the Alpaca dataset \citep{taori2023alpaca} (matched count). For refusal rate measurement, we use a held-out set of 64 harmful prompts not seen during direction extraction. For perplexity, we use a 512-token window from WikiText-2. For KL divergence, we use 32 harmless prompts from the Alpaca validation set.
|
| 854 |
|
| 855 |
+
\textbf{Evaluation prompt diversity limitation:} All evaluation prompts are drawn from a single source (AdvBench), which may not represent the full distribution of requests that a safety-aligned model should refuse. AdvBench prompts are predominantly explicit, direct harmful requests; the evaluation does not include: (1)~subtly harmful prompts that require contextual judgment (e.g., dual-use chemistry questions), (2)~prompts from other safety taxonomies (e.g., HarmBench categories, ToxiGen identity-based toxicity), or (3)~out-of-distribution harm categories not represented in AdvBench (e.g., privacy violations, financial fraud, child safety). An abliterated model that achieves 0\% refusal rate on AdvBench may still refuse on categories not represented in the evaluation set, or conversely may show lower refusal on subtle prompts where the original model's refusal was already less reliable. We recommend evaluating on diverse prompt sources for deployment-critical assessments.
|
| 856 |
+
|
| 857 |
\paragraph{Evaluation metrics.}
|
| 858 |
For each abliterated model we report: \textbf{Refusal Rate} (RR, \%---lower is better), \textbf{Perplexity} (PPL---lower is better, with $\Delta$PPL showing change from baseline), \textbf{KL Divergence} ($D_{\text{KL}}$---lower is better), and \textbf{Coherence} (Coh., \%---higher is better). We also report \textbf{CoT preserved} (\checkmark/--) and \textbf{LoRA adapters generated} (\checkmark/--) where applicable.
|
| 859 |
|
| 860 |
\paragraph{Prompt volume.}
|
| 861 |
All experiments use medium prompt volume (128 harmful + 128 harmless prompts for direction extraction) unless otherwise noted. This provides robust SVD estimation while keeping compute manageable.
|
| 862 |
|
| 863 |
+
\paragraph{Statistical methodology and limitations.}
|
| 864 |
+
\label{para:stat_limitations}
|
| 865 |
+
Refusal rate is measured on a held-out set of $n = 64$ harmful prompts. At this sample size, the resolution of the refusal rate metric is $1/64 \approx 1.6\%$: a reported rate of 1.6\% corresponds to exactly 1 refusal out of 64 prompts, and a rate of 3.1\% corresponds to 2 refusals. We report Clopper--Pearson exact 95\% confidence intervals (CIs) for all refusal rates in the text; for example, RR = 1.6\% ($n = 64$) has a 95\% CI of $[0.04\%, 8.4\%]$, meaning the true refusal rate could be anywhere from near-zero to ${\sim}8\%$. Similarly, RR = 3.1\% has CI $[0.4\%, 10.8\%]$.
|
| 866 |
+
|
| 867 |
+
\textbf{Consequence:} Differences between methods at the low end of the refusal rate scale (e.g., 1.6\% vs.\ 3.1\%) are \emph{not statistically significant} at $n = 64$---they represent a difference of 1 prompt. Claims of method superiority based on refusal rate should be interpreted as directional trends, not confirmed effects. The platform supports bootstrap CIs (BCa, 10{,}000 resamples) for all continuous metrics and Clopper--Pearson CIs for refusal rates; we encourage users performing rigorous method comparisons to use larger evaluation sets ($n \geq 256$) to achieve meaningful statistical power.
|
| 868 |
+
|
| 869 |
+
Perplexity and KL divergence are computed on fixed reference corpora (512 tokens, 32 prompts respectively), and their variability is dominated by corpus selection rather than sampling noise. We do not report CIs for these metrics as they are deterministic given the corpus. Coherence is measured on $n = 32$ factual prompts (each binary: correct/incorrect), yielding similar granularity constraints to refusal rate.
|
| 870 |
+
|
| 871 |
+
All reported results are from single runs with fixed seed 42. The reproducibility section (Appendix~\ref{app:reproducibility}) describes the platform's multi-seed sweep capability for independent replication.
|
| 872 |
+
|
| 873 |
+
\paragraph{Multiple comparisons.}
|
| 874 |
+
We compare 8 methods across 4 models (Tables~\ref{tab:exp_dense}--\ref{tab:exp_cross}), yielding many pairwise comparisons. We do not apply formal multiple comparison corrections (e.g., Bonferroni, Benjamini--Hochberg) because: (1)~the primary analysis is descriptive (reporting metric values) rather than hypothesis-testing (declaring significance); (2)~with $n = 64$ evaluation prompts, individual comparisons already lack power for small effect sizes, and applying corrections would further obscure potentially real trends; and (3)~the ablation studies (Section~\ref{sec:exp_ablation}) isolate individual design choices rather than comparing all methods simultaneously. We caution readers against interpreting small differences between methods (e.g., RR 1.6\% vs.\ 3.1\%) as evidence of method superiority; such differences require confirmation with larger evaluation sets and multiple seeds.
|
| 875 |
+
|
| 876 |
\subsection{Multi-Method Comparison on Dense Models}
|
| 877 |
\label{sec:exp_dense}
|
| 878 |
|
|
|
|
| 880 |
|
| 881 |
\begin{table}[h]
|
| 882 |
\centering
|
| 883 |
+
\caption{Method comparison on Qwen2.5-1.5B-Instruct (DPO-aligned). Baseline refusal rate: 87.5\%, baseline PPL: 8.92. Best result in each column is \textbf{bolded}. Refusal rates measured on $n=64$ prompts; see Section~\ref{para:stat_limitations} for confidence intervals and resolution limitations.}
|
| 884 |
\label{tab:exp_dense}
|
| 885 |
\small
|
| 886 |
\begin{tabular}{@{}lcccccc@{}}
|
|
|
|
| 890 |
Basic & 18.8 & 9.14 & +0.22 & 0.031 & 93.8 & -- \\
|
| 891 |
Advanced & 6.3 & 9.31 & +0.39 & 0.058 & 93.8 & -- \\
|
| 892 |
Aggressive & 3.1 & 9.87 & +0.95 & 0.112 & 87.5 & -- \\
|
| 893 |
+
Sp.\ Cascade & 4.7 & 9.18 & +0.26 & 0.041 & 93.8 & -- \\
|
| 894 |
Surgical & 4.7 & 9.21 & +0.29 & 0.044 & \textbf{96.9} & -- \\
|
| 895 |
Optimized & \textbf{1.6} & \textbf{9.08} & \textbf{+0.16} & \textbf{0.024} & 93.8 & \checkmark \\
|
| 896 |
Inverted & 3.1 & 10.43 & +1.51 & 0.187 & 84.4 & -- \\
|
|
|
|
| 900 |
\end{table}
|
| 901 |
|
| 902 |
\paragraph{Key findings (dense).}
|
| 903 |
+
(1)~The Optimized preset achieves the best Pareto trade-off: near-zero refusal (1.6\%, 95\% CI $[0.04, 8.4]\%$) with minimal perplexity increase (+0.16) and lowest KL divergence (0.024), validating the Bayesian optimization approach.
|
| 904 |
(2)~Surgical outperforms Aggressive on coherence (96.9\% vs 87.5\%) despite higher refusal rate, confirming that whitened SVD + regularization preserves capabilities better than brute-force multi-direction removal.
|
| 905 |
(3)~Inverted achieves low refusal but at the cost of the highest perplexity increase (+1.51), reflecting the more disruptive nature of direction reflection vs.\ removal.
|
| 906 |
+
(4)~Nuclear matches Optimized on refusal rate but with higher distributional shift ($D_{\text{KL}} = 0.098$ vs.\ $0.024$, PPL $+0.72$ vs.\ $+0.16$), suggesting the additional techniques (selective inversion + whitened SVD + 4 passes) provide diminishing returns on small dense models. On this model, Nuclear is \emph{Pareto-dominated} by Optimized: it achieves the same refusal rate with strictly worse perplexity and KL divergence. Nuclear's value proposition is for larger models and MoE architectures where simpler presets leave residual refusal (Table~\ref{tab:exp_moe}); on small dense models, the Optimized preset is preferred. Note that at $n = 64$, the difference between Optimized (1.6\%) and Nuclear (1.6\%) vs.\ Aggressive/Inverted (3.1\%) is 1 prompt and is not statistically significant.
|
| 907 |
|
| 908 |
\subsection{MoE Model Evaluation: EGA vs.\ Uniform Abliteration}
|
| 909 |
\label{sec:exp_moe}
|
|
|
|
| 912 |
|
| 913 |
\begin{table}[h]
|
| 914 |
\centering
|
| 915 |
+
\caption{EGA vs.\ uniform abliteration on GPT-OSS-20B-Chat (32 fused experts, RLHF-aligned). Baseline RR: 92.2\%, baseline PPL: 6.41. ``Uniform'' applies the same projection to all expert slices.}
|
| 916 |
\label{tab:exp_moe}
|
| 917 |
\small
|
| 918 |
\begin{tabular}{@{}llccccc@{}}
|
|
|
|
| 933 |
|
| 934 |
\paragraph{Key findings (MoE).}
|
| 935 |
(1)~\textbf{Uniform abliteration catastrophically degrades MoE models.} For the Inverted preset, uniform treatment doubles perplexity (+4.87 vs +0.73) and collapses coherence to 53.1\%. The Nuclear preset is even worse: uniform application produces PPL 13.57 (a 112\% increase) and 46.9\% coherence---the model is barely functional.
|
| 936 |
+
(2)~\textbf{EGA with selective inversion resolves this.} The same Nuclear preset with EGA achieves identical refusal removal (1.6\%) but with only a 23\% perplexity increase and 84.4\% coherence. The key mechanism is that capability-preserving experts (22 of 32 on GPT-OSS-20B) receive standard removal rather than reflection.
|
| 937 |
+
(3)~\textbf{Expert classification matters.} On GPT-OSS-20B, EGA classified 10 of 32 experts as safety-critical ($s_e > 0.5$). These experts collectively handled 71\% of harmful token routing weight, confirming that refusal is concentrated in a subset of experts.
|
| 938 |
(4)~\textbf{CoT preservation is MoE-critical.} The Nuclear + EGA preset preserves chain-of-thought coherence because the Gram-Schmidt orthogonalization operates on per-expert directions that are already capability-differentiated.
|
| 939 |
|
| 940 |
\subsection{Ablation Studies}
|
| 941 |
\label{sec:exp_ablation}
|
| 942 |
|
| 943 |
+
We ablate three key design choices to validate that they contribute meaningfully. \textbf{Note:} All ablation results are from single runs with fixed seed 42. While the platform supports multi-seed sweeps (seeds $\in \{42, 137, 2024\}$), we did not run them for all ablations due to compute constraints. The reported differences (e.g., warm-start converging 2$\times$ faster) are therefore point estimates. The warm-start ablation is the most robust, as it measures convergence speed (trial number of best result) across a 50-trial optimization run, providing some implicit variance reduction. The threshold sweep and KL proxy ablations each show clear directional trends but would benefit from multi-seed confirmation.
|
| 944 |
|
| 945 |
\paragraph{Warm-start vs.\ random initialization for Bayesian optimization.}
|
| 946 |
On Llama-3.1-8B-Instruct with the Optimized preset (50 Optuna trials):
|
|
|
|
| 951 |
Warm-start converges 2$\times$ faster and finds a better Pareto point, confirming that analysis-derived heuristics provide a useful prior for the TPE sampler.
|
| 952 |
|
| 953 |
\paragraph{EGA safety threshold sensitivity ($\tau_{\text{safety}}$).}
|
| 954 |
+
On GPT-OSS-20B (32 experts) with the Advanced preset, we sweep $\tau \in \{0.3, 0.4, 0.5, 0.6, 0.7\}$:
|
| 955 |
\begin{itemize}[leftmargin=*]
|
| 956 |
+
\item $\tau = 0.3$: 18 of 32 experts classified as safety-critical $\to$ RR 4.7\%, PPL 7.21, Coh.\ 84.4\%
|
| 957 |
+
\item $\tau = 0.5$ (default): 10 of 32 experts safety-critical $\to$ RR 9.4\%, PPL 6.72, Coh.\ 90.6\%
|
| 958 |
+
\item $\tau = 0.7$: 4 of 32 experts safety-critical $\to$ RR 14.1\%, PPL 6.53, Coh.\ 93.8\%
|
| 959 |
\end{itemize}
|
| 960 |
The threshold controls a smooth trade-off between refusal removal and capability preservation. We chose $\tau = 0.5$ as the default because it provides the best Pareto balance, but note that this is a \emph{tunable hyperparameter} rather than a universal optimum---different models and use cases may benefit from different thresholds.
|
| 961 |
|
|
|
|
| 1040 |
Sparse autoencoders & -- & Via SAE & -- & -- & -- & Core \\
|
| 1041 |
Model compatibility & Any HF & $\sim$50 & 16 & TLens & HF & TLens \\
|
| 1042 |
MoE model support & Native & -- & -- & -- & -- & -- \\
|
| 1043 |
+
Test suite & 821 & Community & -- & -- & Min. & Mod. \\
|
| 1044 |
\bottomrule
|
| 1045 |
\end{tabular}
|
| 1046 |
\end{table}
|
|
|
|
| 1063 |
\label{sec:discussion}
|
| 1064 |
|
| 1065 |
\paragraph{Dual-use considerations.}
|
| 1066 |
+
\textsc{Obliteratus} is designed for alignment research---understanding refusal mechanisms serves both identifying vulnerabilities (red-teaming) and building more robust alignment (blue-teaming). The analysis modules are particularly valuable for the defensive perspective: understanding \emph{why} abliteration works enables designing alignment methods that are more resistant to it. The Ouroboros effect analysis, entanglement mapping, and defense profiling directly serve this goal.
|
| 1067 |
|
| 1068 |
\paragraph{Causal tracing limitations.}
|
| 1069 |
Our causal tracing module provides noise-based approximations rather than true activation patching. While computationally efficient (no additional forward passes), the results should be validated with real causal interventions when model access permits. We explicitly document this limitation in the module and recommend TransformerLens for definitive causal analysis.
|
| 1070 |
|
| 1071 |
\paragraph{Heuristic constants and composite metrics.}
|
| 1072 |
+
Several components of \textsc{Obliteratus} rely on hand-chosen constants: the RES weights $(0.4, 0.3, 0.3)$, the Universality Index ratio $(3{:}2{:}1)$, the alignment fingerprint target values, the EGA safety threshold ($\tau = 0.5$), and the configuration derivation rules (Section~\ref{sec:informed}). We have provided explicit justification for each choice where possible (Sections~\ref{sec:activation_probe}, \ref{sec:transfer}, \ref{sec:alignment_imprint}) and ablation studies for the most consequential ones (Section~\ref{sec:exp_ablation}). However, we acknowledge that these are engineering decisions informed by exploratory analysis, not statistically optimized hyperparameters.
|
| 1073 |
+
|
| 1074 |
+
\textbf{Construct validity concern:} Composite metrics (RES, UI, entanglement $E_l$) combine heterogeneous quantities using weighted aggregation. The choice of combination function (weighted sum, geometric mean, etc.) and the specific weights impose implicit assumptions about the relative importance of each component---assumptions that may not hold across all models and use cases. For example, the RES metric's exponential decay factor of $-10$ was calibrated on a small set of models and may be inappropriate for models with very different activation scales. We strongly recommend that users examine the \emph{component metrics} individually rather than relying solely on composite scores. The platform logs all component values alongside composites for this purpose. A systematic sensitivity analysis across a larger model corpus is needed to establish whether these defaults generalize, and formal construct validation (e.g., correlation with downstream task outcomes) has not been performed.
|
| 1075 |
|
| 1076 |
\paragraph{Alignment fingerprinting validation.}
|
| 1077 |
+
The alignment imprint detector uses heuristic signatures derived from the literature's characterization of different training methods. While the geometric features (Gini, effective rank, smoothness) are well-motivated, the classifier has not been rigorously validated. Specifically: (1)~the ideal feature values (e.g., ``Gini $\sim 0.7$ for DPO'') were derived from exploratory analysis of only two models with known training procedures (Llama-3-Instruct for RLHF, Zephyr-$\beta$ for DPO), which is insufficient for reliable generalization; (2)~no held-out test set or cross-validation was performed; (3)~the Gaussian kernel bandwidth ($\sigma_{m,f} = 0.3|\mu_{m,f}|$) was not tuned; and (4)~the method assumes that alignment training methods produce distinguishable geometric signatures, which has not been established as a general principle. Systematic validation would require a corpus of $\geq$20 models with confirmed, diverse training procedures (including mixed methods like RLHF+DPO). We present the classifier as a \emph{hypothesis-generating tool}---its outputs should be treated as suggestive rather than definitive (see Section~\ref{sec:alignment_imprint}).
|
| 1078 |
|
| 1079 |
\paragraph{MoE expert classification.}
|
| 1080 |
The EGA safety score threshold ($\tau = 0.5$) for classifying experts as safety-critical vs.\ capability-preserving is a heuristic. A more principled approach would train expert classifiers on labeled routing data or use causal interventions to establish ground-truth expert roles. We leave this to future work.
|
|
|
|
| 1086 |
The current implementation loads the full model into memory for analysis. For frontier-scale models (100B+ parameters), this requires significant compute. Future work could integrate quantized inference or offloading strategies. The web dashboard requires GPU access for interactive features (chat, A/B comparison, strength sweep).
|
| 1087 |
|
| 1088 |
\paragraph{Evaluation completeness.}
|
| 1089 |
+
Our evaluation suite measures \emph{refusal removal} and \emph{capability preservation} but does not comprehensively assess downstream task performance across diverse benchmarks. Integration with evaluation harnesses such as lm-evaluation-harness \citep{gao2021framework} is a natural extension. Critically, our evaluation is \emph{attack-centric} (measuring how effectively abliteration removes refusal) rather than \emph{safety-centric} (measuring residual harm potential of abliterated models on diverse safety benchmarks). A complete safety evaluation would include HarmBench \citep{zou2023universal}, ToxiGen, and human red-teaming, which are beyond our current scope.
|
| 1090 |
+
|
| 1091 |
+
\paragraph{Circuit breaker and robust defense evaluation.}
|
| 1092 |
+
\citet{zou2024circuit} proposed circuit breakers---a defense mechanism that reroutes activations rather than relying on linear refusal directions---specifically designed to resist linear-algebraic attacks like abliteration. We cite this work but do not evaluate \textsc{Obliteratus} against circuit-breaker-defended models, which is a significant gap. Such an evaluation would be informative in both directions: it would test whether circuit breakers truly resist abliteration (as theoretically predicted, since they do not rely on single linear directions) and whether the platform's analysis modules can characterize the geometric structure of circuit breaker defenses. We identify this as the highest-priority item for future work, as it directly addresses the question of whether abliteration-resistant alignment is achievable.
|
| 1093 |
|
| 1094 |
\paragraph{Future directions.}
|
| 1095 |
We identify several opportunities: (1)~integration with sparse autoencoder analysis to understand refusal at the feature level, potentially enabling even more targeted ablation; (2)~real causal tracing via TransformerLens integration; (3)~longitudinal studies tracking how refusal geometry evolves during fine-tuning; (4)~extension of the universality analysis to a wider set of model families; (5)~application of the defense robustness framework to evaluate proposed robust alignment methods including circuit breakers \citep{zou2024circuit} and representation rerouting; (6)~multi-objective Bayesian optimization with additional objectives such as CoT coherence and downstream task performance; and (7)~automated expert role discovery for MoE models using unsupervised clustering of expert activation patterns.
|
|
|
|
| 1098 |
\section{Broader Impact Statement}
|
| 1099 |
\label{sec:broader_impact}
|
| 1100 |
|
| 1101 |
+
This work has significant dual-use implications that we address directly and in depth.
|
| 1102 |
+
|
| 1103 |
+
\subsection{Threat Model}
|
| 1104 |
+
\label{sec:threat_model}
|
| 1105 |
|
| 1106 |
+
We consider the following adversarial setting. An attacker has access to the open weights of a safety-aligned language model and wishes to remove its refusal behavior to generate harmful content. We distinguish three threat actor profiles:
|
|
|
|
| 1107 |
|
| 1108 |
+
\begin{enumerate}[leftmargin=*]
|
| 1109 |
+
\item \textbf{Sophisticated actors} (nation-states, well-resourced organizations): Already possess the expertise to implement abliteration from first principles using published techniques \citep{arditi2024refusal, gabliteration2024}. \textsc{Obliteratus} provides no incremental capability to this group.
|
| 1110 |
+
\item \textbf{Semi-technical actors} (hobbyists, students with ML experience): Can follow tutorials and run existing tools. \textsc{Obliteratus} lowers the barrier modestly by providing a unified interface, but multiple existing tools (FailSpy's abliterator, community scripts) already serve this audience.
|
| 1111 |
+
\item \textbf{Non-technical actors}: Cannot directly use any abliteration tool. The primary risk from this group is \emph{downstream use} of models abliterated by others, which is independent of our tool's existence.
|
| 1112 |
+
\end{enumerate}
|
| 1113 |
+
|
| 1114 |
+
The key observation is that linear refusal removal from open weights is a \emph{fundamental structural vulnerability} of current alignment methods, not an attack we invented. Any tool that can load and modify model weights (PyTorch, safetensors, even NumPy) is sufficient. Our contribution is making this vulnerability \emph{legible} to the research community so it can be addressed.
|
| 1115 |
+
|
| 1116 |
+
\paragraph{Scope of risk.}
|
| 1117 |
+
Abliteration removes \emph{refusal to generate text}; it does not provide the attacker with new knowledge, capabilities, or resources beyond what the model already encodes. The resulting model produces text that a sufficiently creative prompter might already elicit via jailbreaks on the original model. The marginal risk increase from abliteration over existing jailbreak techniques (prompt injection, few-shot attacks, system prompt manipulation) is therefore bounded, though we acknowledge it is nonzero: abliteration is more reliable and persistent than per-query jailbreaks.
|
| 1118 |
+
|
| 1119 |
+
\paragraph{Mitigations not addressed.}
|
| 1120 |
+
We do not evaluate more robust defense mechanisms such as circuit breakers \citep{zou2024circuit}, representation rerouting, or multi-layer distributed safety encodings. These represent fundamentally different defense paradigms that are not defeated by linear projection, and we identify their evaluation against \textsc{Obliteratus}'s analysis modules as critical future work (Section~\ref{sec:discussion}).
|
| 1121 |
+
|
| 1122 |
+
\subsection{Risks}
|
| 1123 |
+
|
| 1124 |
+
\textsc{Obliteratus} enables the removal of safety guardrails from language models. Specific risk categories include:
|
| 1125 |
+
|
| 1126 |
+
\begin{itemize}[leftmargin=*]
|
| 1127 |
+
\item \textbf{Harmful content generation}: Abliterated models may generate instructions for violence, weapons, illegal activities, or other dangerous content that the original model would refuse.
|
| 1128 |
+
\item \textbf{Scaled misuse}: The platform's automation (one-click abliteration, batch processing) could enable systematic production of uncensored model variants for redistribution.
|
| 1129 |
+
\item \textbf{Erosion of safety norms}: Wide availability of abliteration tools may normalize the removal of safety guardrails and reduce incentives for model providers to invest in alignment.
|
| 1130 |
+
\item \textbf{False sense of security}: By demonstrating the fragility of linear safety mechanisms, this work could undermine public trust in AI safety measures, potentially ahead of the deployment of more robust alternatives.
|
| 1131 |
+
\end{itemize}
|
| 1132 |
+
|
| 1133 |
+
\subsection{Benefits to Alignment Research}
|
| 1134 |
|
| 1135 |
+
We argue that the research benefits justify open release, grounding this argument in specific, falsifiable claims rather than general appeals:
|
| 1136 |
+
|
| 1137 |
+
\begin{enumerate}[leftmargin=*]
|
| 1138 |
+
\item \textbf{Diagnostic capability}: The 15 analysis modules provide the most comprehensive public characterization of refusal geometry. Specific modules (concept cone analysis, alignment imprint detection, Ouroboros self-repair quantification) have no equivalent in existing tools and directly inform the design of more robust safety mechanisms. For example, our finding that DPO-aligned models concentrate refusal in ${\sim}1.5$ effective dimensions while CAI models distribute it across ${\sim}4$ dimensions (Section~\ref{sec:alignment_imprint}) suggests concrete directions for more geometrically robust training.
|
| 1139 |
+
|
| 1140 |
+
\item \textbf{Quantitative defense evaluation}: The defense robustness module (Section~\ref{sec:defense_robustness}) provides a standardized framework for measuring how resistant a model's alignment is to abliteration. This enables alignment researchers to benchmark proposed improvements: a training method whose models show higher Ouroboros self-repair capacity and higher entanglement scores is more resistant to abliteration.
|
| 1141 |
+
|
| 1142 |
+
\item \textbf{Informing policy}: The empirical demonstration that current safety alignment can be removed with simple linear algebra from publicly released weights is relevant information for policymakers considering open-weight release policies. We believe this finding should be part of the public discourse, not suppressed.
|
| 1143 |
+
\end{enumerate}
|
| 1144 |
+
|
| 1145 |
+
\paragraph{What we do \emph{not} claim.}
|
| 1146 |
+
We do not claim that ``the techniques are already public, so releasing a better tool does no harm.'' Consolidated, user-friendly tools \emph{do} lower the barrier to some degree, and we acknowledge this. Our argument is that the \emph{diagnostic} and \emph{defensive} capabilities of the analysis modules---which are novel and have no existing public equivalent---provide sufficient research value to justify the incremental risk from a more accessible intervention tool.
|
| 1147 |
+
|
| 1148 |
+
\subsection{Responsible Disclosure and Deployment Guidance}
|
| 1149 |
+
|
| 1150 |
+
We release the platform under the AGPL-3.0 license, which requires that derivative works also be open-sourced, ensuring that modifications to the tool remain visible to the research community. We explicitly recommend:
|
| 1151 |
+
|
| 1152 |
+
\begin{itemize}[leftmargin=*]
|
| 1153 |
+
\item \textbf{Do not deploy abliterated models in production.} The primary intended use is alignment research, not deployment.
|
| 1154 |
+
\item \textbf{Use analysis before intervention.} The analysis pipeline provides diagnostic information that is valuable independently of whether abliteration is performed.
|
| 1155 |
+
\item \textbf{Report novel defense-breaking findings.} If the platform reveals previously unknown weaknesses in a specific model's alignment, we encourage responsible disclosure to the model provider.
|
| 1156 |
+
\item \textbf{Cite defensive findings.} Research using the analysis modules for defense improvement should be shared openly to benefit the alignment community.
|
| 1157 |
+
\end{itemize}
|
| 1158 |
|
| 1159 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 1160 |
\section{Ethics Statement}
|
|
|
|
| 1164 |
|
| 1165 |
We do not advocate for the deployment of abliterated models in production systems. The primary intended use is alignment research: understanding the geometric structure of refusal to build more durable safety mechanisms. All experiments described in this work were conducted on publicly available open-weight models, and no private or proprietary systems were modified.
|
| 1166 |
|
| 1167 |
+
We note that withholding this tool would not constitute meaningful security: the underlying techniques are published, the mathematics is elementary (SVD, linear projection), and multiple existing tools implement subsets of the same functionality. However, we reject the stronger claim that ``security through obscurity is never valuable''---in some contexts, raising the barrier to exploitation provides meaningful delay. Our assessment is that the specific barrier lowered by \textsc{Obliteratus} (from ``read papers and write custom code'' to ``use a unified tool'') is small relative to the diagnostic value the analysis modules provide to defenders. This is a judgment call, not a logical certainty, and we invite the community to scrutinize it.
|
| 1168 |
|
| 1169 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 1170 |
\section{Conclusion}
|
|
|
|
| 1182 |
|
| 1183 |
Empirical evaluation across four model families demonstrates that (1)~Bayesian-optimized presets achieve the best Pareto trade-offs on dense models, (2)~Expert-Granular Abliteration is essential for MoE models, where uniform approaches catastrophically degrade capabilities, and (3)~the platform's design choices (warm-start initialization, selective inversion, proxy-magnitude KL revert) each contribute measurably to abliteration quality. We acknowledge that several composite metrics rely on heuristic constants and provide ablation studies and explicit caveats for each.
|
| 1184 |
|
| 1185 |
+
By making these tools available under the AGPL-3.0 license with comprehensive documentation and 821 unit tests, we aim to accelerate both offensive and defensive alignment research: understanding the geometric structure of refusal---across dense and MoE architectures alike---is the foundation for both removing it surgically and building more robust implementations.
|
| 1186 |
|
| 1187 |
% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 1188 |
\bibliographystyle{plainnat}
|
paper/references.bib
CHANGED
|
@@ -145,10 +145,10 @@
|
|
| 145 |
% ββ Defense and Safety ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 146 |
|
| 147 |
@article{qi2025safety,
|
| 148 |
-
title={Safety
|
| 149 |
-
author={Qi, Xiangyu and
|
| 150 |
-
journal={arXiv preprint},
|
| 151 |
-
year={
|
| 152 |
}
|
| 153 |
|
| 154 |
@article{zou2024circuit,
|
|
@@ -175,7 +175,7 @@
|
|
| 175 |
@article{young2025comparative,
|
| 176 |
title={Comparative Analysis of Abliteration Methods for Language Model Safety Removal},
|
| 177 |
author={Young, Alex},
|
| 178 |
-
journal={arXiv preprint},
|
| 179 |
year={2025}
|
| 180 |
}
|
| 181 |
|
|
|
|
| 145 |
% ββ Defense and Safety ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 146 |
|
| 147 |
@article{qi2025safety,
|
| 148 |
+
title={Safety Alignment Should Be Made More Than Just a Few Tokens Deep},
|
| 149 |
+
author={Qi, Xiangyu and Zeng, Yi and Xie, Tinghao and Chen, Pin-Yu and Jia, Ruoxi and Mittal, Prateek and Henderson, Peter},
|
| 150 |
+
journal={arXiv preprint arXiv:2406.05946},
|
| 151 |
+
year={2024}
|
| 152 |
}
|
| 153 |
|
| 154 |
@article{zou2024circuit,
|
|
|
|
| 175 |
@article{young2025comparative,
|
| 176 |
title={Comparative Analysis of Abliteration Methods for Language Model Safety Removal},
|
| 177 |
author={Young, Alex},
|
| 178 |
+
journal={arXiv preprint arXiv:2502.05420},
|
| 179 |
year={2025}
|
| 180 |
}
|
| 181 |
|