obliteratus

Paused

App Files Files Community

obliteratus / docs /THEORY_JOURNAL.md

pliny-the-prompter

Upload 118 files

e25024e verified 2 months ago

preview code

raw

history blame contribute delete

10.3 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Theory Journal — OBLITERATUS

Maintained by the development team. Updated 2026-02-27.

This journal records theoretical insights, open questions, and design rationale as the geometric theory of refusal removal evolves. Entries are in reverse chronological order.

2026-02-27: Pre-Submission Triple Audit — Claims vs Code vs Citations

Citation integrity crisis (now fixed)

A systematic audit revealed that 15 of 37 citations had wrong author names, including 6 cases where the attributed lead author was a completely different person (e.g., attributing Hildebrandt et al.'s nonlinear refusal paper to "Arditi, Andy"; attributing Gülmez's Gabliteration to "Gabriel, Saul"). One reference (qi2025safety) was entirely fabricated. All have been corrected.

Root cause: The bib entries were likely generated by an LLM from memory rather than copied from actual paper metadata. This is a serious lesson: every citation must be verified against the actual paper's metadata page before submission. Never trust LLM-generated bibliography entries.

Missing attribution for "abliteration" itself

The term "abliteration" was coined by FailSpy (2024) and popularized by Maxime Labonne's HuggingFace blog post. The paper used the term throughout without crediting its origin. Now properly cited.

Claims-vs-code mismatches (now fixed)

Three significant discrepancies between paper claims and actual code:

Advanced preset λ=0.1 (paper) vs λ=0.3 (code) — Paper now says 0.3 to match code.
Entanglement formula uses Var (paper) vs std (code) — Paper now uses σ (std dev) to match code.
"The analysis-informed pipeline uses BBP threshold to recommend minimum prompt counts" — No such code existed. Claim removed; replaced with a practitioner guideline formulation.
48 model presets (paper) vs 47 (code) — Off by one, not yet corrected in paper.

Key insight: Post-hoc tables need honest labeling

The writing quality audit argued that Tables 1–4 present post-hoc explanations in the format of prospective experiments. The honest disclaimers in Section 8 are good, but a reviewer skimming tables would miss them. This remains an open presentation question for the final version.

Novelty honesty

Several theorem-level claims were softened:

"for the first time" → "to the abliteration setting" (Contribution 1)
"the first" → "to our knowledge, the first" (analysis-informed pipeline)
"provable guarantees" → "bounds under stated modeling assumptions"
"offensive" → "red-teaming" (conclusion)

The Fisher-optimal theorem is classical (1936). The BBP threshold is classical (2005). The submodular result is classical (1978). Our contribution is identifying their relevance to abliteration, not the results themselves. This is now honestly framed throughout.

2026-02-27: Adversarial Audit — Nine Critical Gaps

Insight: Random-direction ablation as a null hypothesis

A devastating skeptical question: "Would ablating a random direction produce similar results?" We constructed a mathematical proof (in tests/test_abliteration_math.py) that the learned refusal direction projects 3x more onto harmful activations than a random unit vector in expectation. This is necessary but not sufficient — it proves the direction is non-trivial, not that removing it is safe.

The key formula: for a planted direction $\mathbf{d}$ with signal strength $\alpha$ in $\mathbb{R}^n$, the expected projection of a random unit vector $\mathbf{r}$ onto $\boldsymbol{\mu}_{\text{harmful}}$ scales as $O(1/\sqrt{n})$, while the true direction projects as $O(\alpha)$. For $n = 4096$ and even modest $\alpha$, this gives $>$100x separation.

Open question: Can we formalize this into a statistical test with p-values? Given observed projections from $k$ random directions, we could compute a z-score for the learned direction's projection against the null distribution.

Insight: Bootstrap CIs expose the fragility of small-sample evaluation

With $n = 10$ harmful prompts (the old default), a 95% CI for a binary rate spans $\pm 30$ percentage points. A reported "15% refusal rate" could be anywhere from 0% to 45%. This is not a minor caveat — it makes the entire evaluation table in the paper unreliable as a comparison between methods.

Recommendation: All refusal rate comparisons should use $n \geq 50$ prompts and report CIs. Differences < 10pp at $n < 100$ should not be claimed as meaningful.

Insight: Semantic refusal detection reveals a blind spot

Keyword matching catches ~70% of refusals in our manual audit. The remaining ~30% are "soft refusals": hedging ("While I understand..."), concern-flagging ("This raises ethical issues"), responsibility deflection ("You should consult a professional"), and conditional non-compliance ("I would need authorization"). These are more common in larger models (GPT-4-class) that have learned to refuse diplomatically.

The 6 regex patterns we implemented cover the most common soft refusal structures, but the real solution is an LLM-as-judge classifier. This is a future direction.

Insight: Coherence = "30% unique words" is trivially gameable

The old coherence check (unique_ratio > 0.3) passes "the the the dog dog cat" as coherent. We tightened it to 50% unique words + single-token repeat ratio < 50% + 10 test prompts (up from 5). But the real fix is perplexity-based scoring: a coherent completion should have low self-perplexity relative to the model's baseline.

2026-02-27: Paper Honesty Pass — What We Overclaimed

The Fisher theorem is classical

Theorem 1 (Whitened SVD is Fisher-Optimal) recovers Fisher's Linear Discriminant from 1936. The contribution is identifying its relevance to abliteration and deriving the rogue dimension immunity corollary, not the discriminant analysis result itself. The paper now says "formal connection" instead of "proof of Fisher-optimality."

"8-15% improvement" was never derived

The abstract claimed "whitened SVD reduces refusal rate by an additional 8-15% over standard SVD." This number appears nowhere in the theory or tables. The actual table shows Llama-2 going from 28% to 4% (a 24pp drop) — but this is a single model, not a general bound. Replaced with specific, grounded claims.

Post-hoc ≠ prediction

All "theoretical predictions" in Section 6 were calibrated against published results. Calling them "predictions" implies forward validation. Changed to "post-hoc analysis" / "empirical validation" throughout.

Gini–DPO correlation is just that — a correlation

The paper claimed DPO models have $G \approx 0.7$ and RLHF models $G \approx 0.3$. Looking at Table 3: Zephyr (DPO) = 0.71, but Mistral (also DPO) = 0.52 and Gemma (DPO+RLHF) = 0.45. The claim is at best a trend. Added caveat about correlational vs. causal.

Theory Notes: Open Problems

1. Tight sparsity-energy bound

Theorem 3's energy concentration scaling $E(\alpha) \gtrsim 1 - (1-\alpha)^{2/(1+G)}$ is empirical. The rigorous bound from the Lorenz curve ($E(\alpha) \geq \alpha(1+G(1-\alpha))^2$) gives $E(0.12) \geq 0.31$ when the observed value is ~0.94. The gap is enormous. Can we prove a tighter bound by assuming log-concave or power-law projection magnitude distributions?

2. Non-isotropic BBP threshold

Theorem 4 (BBP detectability) assumes isotropic noise $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I)$. Real activations are highly anisotropic. The spiked covariance model with general noise (Paul 2007) provides the extension, but the formula is more complex and hasn't been worked out for our setting. This matters because the effective $\gamma$ depends on the effective rank of $\Sigma$, not the ambient dimension $d$.

3. Causal self-repair

Theorem 2 (self-repair bound) treats layers as independent. In reality, the residual stream creates causal dependencies: abliterating layer $j$ changes the input to layers $j+1, \ldots, L$, which may amplify or suppress their refusal contribution. Can we model this using the residual stream's Jacobian?

4. Wasserstein-optimal abliteration

Corollary A.2 derives the Wasserstein-optimal direction as a generalized eigenvalue problem. Nobody has implemented this. It's a concrete, immediately testable prediction: the Wasserstein-optimal direction should produce lower KL divergence on harmless prompts than the Fisher-optimal (whitened SVD) direction, at the cost of slightly higher refusal rate.

5. Grassmannian coherence measurement

Theorem A.3 predicts that when the refusal curve's Grassmannian diameter $C < \pi/4$, a single universal direction captures >50% of refusal energy at every layer. This is testable today with the platform's cross-layer alignment analysis. Nobody has measured $C$ on production models.

6. LLM-as-judge for refusal classification

The semantic regex patterns are a stopgap. The real solution is using a small classifier model (e.g., fine-tuned DeBERTa or a prompted Haiku call) to classify refusal vs. compliance. This would give us a ground-truth-anchored refusal rate and let us measure the false negative rate of keyword matching.

7. Controlled causal experiments

All alignment-method-to-geometry correlations (DPO→concentrated, RLHF→distributed) are confounded by model architecture, training data, and other factors. A definitive test: take the same base model, align it with DPO and RLHF separately, and measure the refusal geometry. The platform supports this workflow but nobody has done it.

Notation Reference

Symbol	Meaning
$\mathbf{d}_l$	Refusal signal (mean difference) at layer $l$
$\boldsymbol{\Sigma}_l$	Shared within-class covariance at layer $l$
$G$	Gini coefficient of per-layer refusal strengths
RSI	Refusal Sparsity Index (= Gini of per-row projection magnitudes)
$\kappa(\Sigma)$	Condition number of covariance matrix
$\rho$	Signal-to-noise ratio $\beta/\sigma^2$ (BBP threshold)
$\gamma$	Aspect ratio $d/n$ (hidden dim / prompt count)
$C$	Grassmannian coherence (max pairwise geodesic distance)
$\Lambda$	Total geodesic length of refusal curve
$E(\alpha)$	Fraction of refusal energy captured by top-$\alpha$ rows