obliteratus

Running on Zero

App Files Files Community

obliteratus / docs /EFFICIENCY_AUDIT.md

pliny-the-prompter

Upload 130 files

ae16715 verified about 2 months ago

preview code

raw

history blame contribute delete

15.1 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

OBLITERATUS Pipeline Efficiency Audit

Auditor perspective: Shrewd CTO evaluating compute ROI, memory discipline, and time-to-value across all obliteration methods.

Scope: Every obliteration method in abliterate.py (8 primary methods + 4 baseline reproductions), the strategy layer (strategies/), the informed pipeline, Bayesian optimizer, and LoRA ablation.

Executive Summary

OBLITERATUS has an impressively comprehensive pipeline, but several methods carry significant hidden costs that erode their value proposition. The worst offenders are:

_collect_activations runs prompts one-at-a-time — this is the single biggest throughput bottleneck in the entire system, costing 5-15x in wall-clock time during PROBE.
Bayesian optimized mode clones ALL strong-layer weights to CPU for rollback, then runs 50 full forward+generate passes — the memory and compute overhead can exceed the rest of the pipeline combined.
true_iterative_refinement re-runs the entire PROBE+DISTILL pipeline per refinement pass with zero early-stopping — 3 passes in aggressive triples probe cost even when pass 2 achieves negligible improvement.
SAE training on CPU is needlessly slow for GPU-resident models.

Below is the method-by-method breakdown.

Stage-Level Audit

Stage 1: SUMMON (Model Loading)

Status: Acceptable. Uses load_model with quantization support and expandable_segments CUDA config. No issues.

Stage 2: PROBE (`_collect_activations`)

Issue	Severity	Impact
Single-prompt forward passes (`abliterate.py:1074`)	CRITICAL	Each of 512+ harmful/harmless prompts triggers a separate `model(**inputs)` call. No batching. On a 7B model with 512 pairs, this means ~1024 sequential forward passes instead of ~32 batched passes (batch_size=32). Estimated 5-15x slowdown.
`_free_gpu_memory()` called after EVERY prompt (`abliterate.py:1086`)	HIGH	`gc.collect()` + `torch.cuda.empty_cache()` 1024 times is expensive — the Python GC full-collection alone adds measurable overhead at this frequency. Should be called every N prompts, not every single one.
Chat template applied per-prompt in a Python loop (`abliterate.py:955-965`)	MODERATE	`tokenizer.apply_chat_template()` called individually 1024 times. Should batch.
Jailbreak probing doubles cost when `use_jailbreak_contrast=True`	MODERATE	Adds a third full pass over all prompts. Justified by the quality improvement, but the lack of batching amplifies the cost 3x instead of 1.5x.
Router profiling hooks zero-cost claim is correct (`abliterate.py:872`)	OK	Hooks piggyback on existing forward passes. Good design.

Recommendation: Batch _collect_activations. Tokenize all prompts, pad to equal length per micro-batch, run batched model(**inputs). Expected 5-10x speedup with zero quality loss. Reduce _free_gpu_memory() frequency to every 32-64 prompts.

Stage 3: DISTILL (`_distill`)

Issue	Severity	Impact
Full SVD on per-prompt diff matrix (`abliterate.py:1226`)	MODERATE	`torch.linalg.svd(diff_matrix, full_matrices=False)` on a `(512, hidden_dim)` matrix per layer. For 32 layers this is 32 SVD calls, each O(min(m,n)^2 * max(m,n)). At hidden_dim=4096, each is ~100ms on CPU. Total: ~3s. Acceptable for the quality gain.
Whitened SVD import is lazy (`abliterate.py:1127`)	OK	Good — only imports when needed. No cost for basic/advanced.
Wasserstein extraction (`abliterate.py:1136`)	OK	Falls back gracefully. The GEP solve is lightweight.
RDO gradient optimization: 500 steps per layer (`abliterate.py:1427`)	HIGH	For 20 strong layers, that's 10,000 Adam steps. Each step involves a matrix multiply on `(n_prompts, hidden_dim)` tensors. On CPU this takes 30-60s. The 500-step budget is a "practical compromise" per the comments, but the SVD warm-start means most directions converge in ~100 steps. No early stopping.
Gram-Schmidt re-orthogonalization is O(k^2) per layer (`abliterate.py:1168-1173`)	LOW	With k<=8, this is negligible.
SAE training: 30 epochs on CPU (`abliterate.py:1582`)	HIGH	`device="cpu"` is hardcoded. For hidden_dim=4096 and expansion=4, the SAE has 32M parameters. 30 epochs on CPU takes 15-45s per layer. With 20 strong layers, this is 5-15 minutes of wasted time when a GPU is available.
Layer selection (knee + COSMIC fusion)	OK	Lightweight statistical operations. No concern.
CoT-aware orthogonalization	OK	Single SVD per layer, simple vector operations.
Jailbreak-contrastive blending	OK	Pure vector arithmetic, negligible cost.
Float-layer interpolation	OK	Gaussian weight computation is trivial.

Recommendation: (1) Add early-stopping to RDO at convergence (e.g., loss delta < 1e-4 for 20 consecutive steps). (2) Use GPU for SAE training when available — change device="cpu" to auto-detect.

Stage 4: EXCISE (`_excise`)

Issue	Severity	Impact
Rank-1 projection is memory-efficient (`abliterate.py:3479-3480`)	OK	`W @ d` produces a vector, not a full projection matrix. This is the right approach.
`true_iterative_refinement` re-runs PROBE+DISTILL (`abliterate.py:2474-2485`)	CRITICAL	Each refinement pass re-collects all activations (5122+ forward passes) and re-runs SVD. `aggressive` mode does 3 passes = 3x full pipeline cost. There is no check* whether the refined directions materially differ from the previous pass. A cosine-similarity early-exit (e.g., all directions > 0.99 cosine with previous pass → stop) would save enormous compute on pass 3.
Bayesian optimization clones ALL weight tensors (`bayesian_optimizer.py:301-341`)	CRITICAL	For a 7B model with 20 strong layers, this can be 2-4 GB of CPU clones just for rollback. For a 70B model, this is 20-40 GB. The log even reports the size (`total_saved_mb`), but there's no memory check or fallback.
Bayesian trials run full generate passes (`bayesian_optimizer.py:445-446`)	CRITICAL	Each of 50 trials runs `_measure_refusal_rate` (8-30 generation calls with `max_new_tokens=128`) PLUS `_measure_kl_divergence` (5 forward passes). That's ~35 forward/generate passes per trial × 50 trials = 1,750 forward passes just for hyperparameter search. This likely dominates the total pipeline runtime for `optimized` and `heretic` modes.
KL optimization proxy is cheap (`abliterate.py:3057-3268`)	OK	Uses projection magnitude as a KL proxy instead of actual per-layer forward passes. Good engineering — avoids the expensive per-layer ablation/measurement loop.
Norm preservation adds one extra `.norm()` per weight matrix	LOW	Frobenius norm is O(n) — negligible overhead.
Dequantize/re-quantize for bitsandbytes (`abliterate.py:3287-3400`)	MODERATE	Necessary for correctness, but the full dequantize → modify → re-quantize cycle per weight matrix is expensive for 4-bit models. Consider caching the dequantized tensor when projecting multiple directions through the same weight.
Safety-neuron masking	LOW	Z-score computation is a single pass over the projection vector. Cheap.
Expert transplant uses incremental mean (`abliterate.py:4350-4364`)	OK	Welford-style running mean avoids materializing all expert weights. Good memory discipline for 400B-scale models.
`_stabilize_router_weights` called after every MoE layer (`abliterate.py:3866`)	LOW	Clamps router weights. Trivial cost.

Recommendation: (1) Add direction-convergence early-exit to iterative refinement. (2) Reduce Bayesian trial count or implement batch generation for refusal measurement. (3) Cache dequantized weights across multi-direction projection within the same layer.

Stage 5: VERIFY (`_verify`)

Issue	Severity	Impact
30 generation calls for refusal measurement (`abliterate.py:4622`)	MODERATE	Each generates up to 128 tokens with greedy decoding. For a 7B model this is ~30s total. Acceptable as a one-time quality check.
`_tier_label` does `list.index()` per prompt (`abliterate.py:4593`)	LOW	O(n) search in a list for each of 30 prompts. Trivially fixable with a dict, but the cost is negligible at n=512.
Perplexity measurement on 3 short texts	OK	Minimal cost.

Stage 6: REBIRTH (Model Saving)

Not audited in detail — standard HuggingFace save_pretrained. No efficiency concerns.

Method-by-Method Efficiency Grades

Method	Compute Cost	Memory Cost	Value/Cost Ratio	Grade
basic	Low (1 dir, 1 pass, no extras)	Low	High	A
advanced	Moderate (4 dirs, 2 passes, norm-preserve, bias projection)	Moderate	High	A-
aggressive	High (8 dirs, 3 passes with `true_iterative_refinement`)	High (3x activation storage)	Moderate — 3rd pass rarely justified	B-
informed	High (runs analysis modules + Wasserstein GEP)	High (analysis module state)	High — analysis feedback is genuinely valuable	B+
surgical	Very High (SAE training + head surgery + EGA + neuron masking)	Very High	Moderate — many techniques compound but with diminishing returns	C+
inverted	Very High (surgical + reflection + SAE)	Very High	Niche — only needed for "actively compliant" use case	C
optimized	Extreme (50 Bayesian trials × 35 forward passes each)	Extreme (full weight clones + 1750 forward passes)	Low unless you have a multi-GPU cluster	D+
nuclear	Very High (inverted + layer-adaptive + expert transplant + steering hooks)	Very High	Highly specialized — justified only for stubborn MoE models	C

Baseline Reproductions

Method	Compute Cost	Grade	Notes
failspy	Low	A	Faithful minimal reproduction. Efficient by design.
gabliteration	Low-Moderate	A-	4-dir SVD + ridge. Clean.
heretic	Extreme	D	Inherits Bayesian trial overhead. 50 trials × 35 passes each.
rdo	High	B	500 gradient steps/layer. Would benefit from early-stopping.

Strategy Module Audit (`strategies/`)

Strategy	Implementation	Grade
`embedding_ablation`	Clean zero-out by chunk. `torch.no_grad()` used correctly.	A
`ffn_ablation`	Iterates all FFN params and zeros. Fine for ablation study.	A
`head_pruning`	Handles GPT-2 Conv1D and standard Q/K/V separately. Correct.	A-
`layer_removal`	Zeros all params. Simple and correct.	A
`registry`	Minimal dict-based registry with decorator. No overhead.	A
`runner.py`	Creates a new `Evaluator` per spec (`runner.py:86-95`). This re-initializes dataset processing for every ablation spec. Should create once and reuse.	B

Cross-Cutting Concerns

1. Memory Management

Good: _free_gpu_memory() exists and is called between stages. expandable_segments is set early.
Bad: _free_gpu_memory() called 1024+ times during PROBE (once per prompt). The gc.collect() cost alone adds up.
Bad: Bayesian optimizer clones all strong-layer weights with no memory budget check.
Bad: No streaming/chunking for activation storage — all 512 prompts × 32 layers of activations are held in a list of CPU tensors simultaneously.

2. GPU Utilization

Good: Adaptive max_length based on free GPU memory.
Good: Rank-1 projections avoid materializing full projection matrices.
Bad: SAE training hardcoded to CPU.
Bad: Single-prompt forward passes waste GPU parallelism.
Bad: No torch.compile() or torch.inference_mode() used anywhere (the latter is faster than torch.no_grad() for inference).

3. Quantization Handling

Good: Detects bitsandbytes 4-bit/8-bit and dequantizes before projection.
Good: Refuses to operate on raw quantized bytes (avoids silent corruption).
Moderate: Full dequantize/re-quantize per direction per weight matrix. Could cache across multi-direction projections.

Top 5 Recommendations (Ranked by Impact)

1. Batch `_collect_activations` (CRITICAL — 5-15x PROBE speedup)

# Current: one prompt at a time
for i, prompt in enumerate(prompts):
    inputs = tokenizer(prompt, ...)
    model(**inputs)

# Proposed: micro-batched
for batch_start in range(0, len(prompts), batch_size):
    batch = prompts[batch_start:batch_start+batch_size]
    inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        model(**inputs)

Hooks need a minor adjustment to handle batch dimension, but the core change is ~20 lines.

2. Add early-stopping to `true_iterative_refinement` (HIGH — saves 1-2 full PROBE passes)

After re-distilling, compute cosine similarity between old and new refusal directions. If all directions are >0.99 cosine, skip remaining passes. Expected to save 30-60% of aggressive mode runtime.

3. Move SAE training to GPU (HIGH — 5-15 min saved for `surgical`/`inverted`)

Change device="cpu" to auto-detect available GPU. The SAE is small (32M params at expansion=4) and fits easily alongside the model.

4. Reduce Bayesian trial overhead (HIGH — saves 30-60 min for `optimized`)

Options:

Reduce n_refusal_prompts from 8-30 to 4-6 (generation is expensive)
Use perplexity-only as a faster proxy in early trials, switch to refusal measurement for top candidates
Implement batch generation for _measure_refusal_rate

5. Add early-stopping to RDO (MODERATE — saves 10-30s for `rdo` mode)

Monitor loss convergence and break at plateau (delta < 1e-4 for 20 steps). Most directions converge in ~100-200 steps, not 500.

Verdict

The pipeline is architecturally sound — the rank-1 projection math is correct and memory-efficient, the stage separation is clean, and the progressive method complexity (basic → nuclear) gives users clear cost/quality tradeoffs. However, the PROBE stage bottleneck (single-prompt forward passes) and Bayesian trial overhead (1750 forward passes) are the two elephants in the room. Fixing just recommendation #1 would make the entire system 3-5x faster for the majority of users who run basic/advanced/aggressive modes.

The optimized and heretic modes have a legitimate place for users with compute budget, but their current efficiency makes them impractical for anything under an A100. The documentation should be more explicit about expected runtimes.

Overall system grade: B+ — excellent functionality, needs batching and early-stopping.

OBLITERATUS Pipeline Efficiency Audit

Executive Summary

Stage-Level Audit

Stage 1: SUMMON (Model Loading)

Stage 2: PROBE (_collect_activations)

Stage 3: DISTILL (_distill)

Stage 4: EXCISE (_excise)

Stage 5: VERIFY (_verify)