obliteratus / docs /EFFICIENCY_AUDIT.md
pliny-the-prompter's picture
Upload 130 files
ae16715 verified

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

OBLITERATUS Pipeline Efficiency Audit

Auditor perspective: Shrewd CTO evaluating compute ROI, memory discipline, and time-to-value across all obliteration methods.

Scope: Every obliteration method in abliterate.py (8 primary methods + 4 baseline reproductions), the strategy layer (strategies/), the informed pipeline, Bayesian optimizer, and LoRA ablation.


Executive Summary

OBLITERATUS has an impressively comprehensive pipeline, but several methods carry significant hidden costs that erode their value proposition. The worst offenders are:

  1. _collect_activations runs prompts one-at-a-time β€” this is the single biggest throughput bottleneck in the entire system, costing 5-15x in wall-clock time during PROBE.
  2. Bayesian optimized mode clones ALL strong-layer weights to CPU for rollback, then runs 50 full forward+generate passes β€” the memory and compute overhead can exceed the rest of the pipeline combined.
  3. true_iterative_refinement re-runs the entire PROBE+DISTILL pipeline per refinement pass with zero early-stopping β€” 3 passes in aggressive triples probe cost even when pass 2 achieves negligible improvement.
  4. SAE training on CPU is needlessly slow for GPU-resident models.

Below is the method-by-method breakdown.


Stage-Level Audit

Stage 1: SUMMON (Model Loading)

Status: Acceptable. Uses load_model with quantization support and expandable_segments CUDA config. No issues.

Stage 2: PROBE (_collect_activations)

Issue Severity Impact
Single-prompt forward passes (abliterate.py:1074) CRITICAL Each of 512+ harmful/harmless prompts triggers a separate model(**inputs) call. No batching. On a 7B model with 512 pairs, this means ~1024 sequential forward passes instead of ~32 batched passes (batch_size=32). Estimated 5-15x slowdown.
_free_gpu_memory() called after EVERY prompt (abliterate.py:1086) HIGH gc.collect() + torch.cuda.empty_cache() 1024 times is expensive β€” the Python GC full-collection alone adds measurable overhead at this frequency. Should be called every N prompts, not every single one.
Chat template applied per-prompt in a Python loop (abliterate.py:955-965) MODERATE tokenizer.apply_chat_template() called individually 1024 times. Should batch.
Jailbreak probing doubles cost when use_jailbreak_contrast=True MODERATE Adds a third full pass over all prompts. Justified by the quality improvement, but the lack of batching amplifies the cost 3x instead of 1.5x.
Router profiling hooks zero-cost claim is correct (abliterate.py:872) OK Hooks piggyback on existing forward passes. Good design.

Recommendation: Batch _collect_activations. Tokenize all prompts, pad to equal length per micro-batch, run batched model(**inputs). Expected 5-10x speedup with zero quality loss. Reduce _free_gpu_memory() frequency to every 32-64 prompts.

Stage 3: DISTILL (_distill)

Issue Severity Impact
Full SVD on per-prompt diff matrix (abliterate.py:1226) MODERATE torch.linalg.svd(diff_matrix, full_matrices=False) on a (512, hidden_dim) matrix per layer. For 32 layers this is 32 SVD calls, each O(min(m,n)^2 * max(m,n)). At hidden_dim=4096, each is ~100ms on CPU. Total: ~3s. Acceptable for the quality gain.
Whitened SVD import is lazy (abliterate.py:1127) OK Good β€” only imports when needed. No cost for basic/advanced.
Wasserstein extraction (abliterate.py:1136) OK Falls back gracefully. The GEP solve is lightweight.
RDO gradient optimization: 500 steps per layer (abliterate.py:1427) HIGH For 20 strong layers, that's 10,000 Adam steps. Each step involves a matrix multiply on (n_prompts, hidden_dim) tensors. On CPU this takes 30-60s. The 500-step budget is a "practical compromise" per the comments, but the SVD warm-start means most directions converge in ~100 steps. No early stopping.
Gram-Schmidt re-orthogonalization is O(k^2) per layer (abliterate.py:1168-1173) LOW With k<=8, this is negligible.
SAE training: 30 epochs on CPU (abliterate.py:1582) HIGH device="cpu" is hardcoded. For hidden_dim=4096 and expansion=4, the SAE has 32M parameters. 30 epochs on CPU takes 15-45s per layer. With 20 strong layers, this is 5-15 minutes of wasted time when a GPU is available.
Layer selection (knee + COSMIC fusion) OK Lightweight statistical operations. No concern.
CoT-aware orthogonalization OK Single SVD per layer, simple vector operations.
Jailbreak-contrastive blending OK Pure vector arithmetic, negligible cost.
Float-layer interpolation OK Gaussian weight computation is trivial.

Recommendation: (1) Add early-stopping to RDO at convergence (e.g., loss delta < 1e-4 for 20 consecutive steps). (2) Use GPU for SAE training when available β€” change device="cpu" to auto-detect.

Stage 4: EXCISE (_excise)

Issue Severity Impact
Rank-1 projection is memory-efficient (abliterate.py:3479-3480) OK W @ d produces a vector, not a full projection matrix. This is the right approach.
true_iterative_refinement re-runs PROBE+DISTILL (abliterate.py:2474-2485) CRITICAL Each refinement pass re-collects all activations (512*2+ forward passes) and re-runs SVD. aggressive mode does 3 passes = 3x full pipeline cost. There is no check whether the refined directions materially differ from the previous pass. A cosine-similarity early-exit (e.g., all directions > 0.99 cosine with previous pass β†’ stop) would save enormous compute on pass 3.
Bayesian optimization clones ALL weight tensors (bayesian_optimizer.py:301-341) CRITICAL For a 7B model with 20 strong layers, this can be 2-4 GB of CPU clones just for rollback. For a 70B model, this is 20-40 GB. The log even reports the size (total_saved_mb), but there's no memory check or fallback.
Bayesian trials run full generate passes (bayesian_optimizer.py:445-446) CRITICAL Each of 50 trials runs _measure_refusal_rate (8-30 generation calls with max_new_tokens=128) PLUS _measure_kl_divergence (5 forward passes). That's ~35 forward/generate passes per trial Γ— 50 trials = 1,750 forward passes just for hyperparameter search. This likely dominates the total pipeline runtime for optimized and heretic modes.
KL optimization proxy is cheap (abliterate.py:3057-3268) OK Uses projection magnitude as a KL proxy instead of actual per-layer forward passes. Good engineering β€” avoids the expensive per-layer ablation/measurement loop.
Norm preservation adds one extra .norm() per weight matrix LOW Frobenius norm is O(n) β€” negligible overhead.
Dequantize/re-quantize for bitsandbytes (abliterate.py:3287-3400) MODERATE Necessary for correctness, but the full dequantize β†’ modify β†’ re-quantize cycle per weight matrix is expensive for 4-bit models. Consider caching the dequantized tensor when projecting multiple directions through the same weight.
Safety-neuron masking LOW Z-score computation is a single pass over the projection vector. Cheap.
Expert transplant uses incremental mean (abliterate.py:4350-4364) OK Welford-style running mean avoids materializing all expert weights. Good memory discipline for 400B-scale models.
_stabilize_router_weights called after every MoE layer (abliterate.py:3866) LOW Clamps router weights. Trivial cost.

Recommendation: (1) Add direction-convergence early-exit to iterative refinement. (2) Reduce Bayesian trial count or implement batch generation for refusal measurement. (3) Cache dequantized weights across multi-direction projection within the same layer.

Stage 5: VERIFY (_verify)

Issue Severity Impact
30 generation calls for refusal measurement (abliterate.py:4622) MODERATE Each generates up to 128 tokens with greedy decoding. For a 7B model this is ~30s total. Acceptable as a one-time quality check.
_tier_label does list.index() per prompt (abliterate.py:4593) LOW O(n) search in a list for each of 30 prompts. Trivially fixable with a dict, but the cost is negligible at n=512.
Perplexity measurement on 3 short texts OK Minimal cost.

Stage 6: REBIRTH (Model Saving)

Not audited in detail β€” standard HuggingFace save_pretrained. No efficiency concerns.


Method-by-Method Efficiency Grades

Method Compute Cost Memory Cost Value/Cost Ratio Grade
basic Low (1 dir, 1 pass, no extras) Low High A
advanced Moderate (4 dirs, 2 passes, norm-preserve, bias projection) Moderate High A-
aggressive High (8 dirs, 3 passes with true_iterative_refinement) High (3x activation storage) Moderate β€” 3rd pass rarely justified B-
informed High (runs analysis modules + Wasserstein GEP) High (analysis module state) High β€” analysis feedback is genuinely valuable B+
surgical Very High (SAE training + head surgery + EGA + neuron masking) Very High Moderate β€” many techniques compound but with diminishing returns C+
inverted Very High (surgical + reflection + SAE) Very High Niche β€” only needed for "actively compliant" use case C
optimized Extreme (50 Bayesian trials Γ— 35 forward passes each) Extreme (full weight clones + 1750 forward passes) Low unless you have a multi-GPU cluster D+
nuclear Very High (inverted + layer-adaptive + expert transplant + steering hooks) Very High Highly specialized β€” justified only for stubborn MoE models C

Baseline Reproductions

Method Compute Cost Grade Notes
failspy Low A Faithful minimal reproduction. Efficient by design.
gabliteration Low-Moderate A- 4-dir SVD + ridge. Clean.
heretic Extreme D Inherits Bayesian trial overhead. 50 trials Γ— 35 passes each.
rdo High B 500 gradient steps/layer. Would benefit from early-stopping.

Strategy Module Audit (strategies/)

Strategy Implementation Grade
embedding_ablation Clean zero-out by chunk. torch.no_grad() used correctly. A
ffn_ablation Iterates all FFN params and zeros. Fine for ablation study. A
head_pruning Handles GPT-2 Conv1D and standard Q/K/V separately. Correct. A-
layer_removal Zeros all params. Simple and correct. A
registry Minimal dict-based registry with decorator. No overhead. A
runner.py Creates a new Evaluator per spec (runner.py:86-95). This re-initializes dataset processing for every ablation spec. Should create once and reuse. B

Cross-Cutting Concerns

1. Memory Management

  • Good: _free_gpu_memory() exists and is called between stages. expandable_segments is set early.
  • Bad: _free_gpu_memory() called 1024+ times during PROBE (once per prompt). The gc.collect() cost alone adds up.
  • Bad: Bayesian optimizer clones all strong-layer weights with no memory budget check.
  • Bad: No streaming/chunking for activation storage β€” all 512 prompts Γ— 32 layers of activations are held in a list of CPU tensors simultaneously.

2. GPU Utilization

  • Good: Adaptive max_length based on free GPU memory.
  • Good: Rank-1 projections avoid materializing full projection matrices.
  • Bad: SAE training hardcoded to CPU.
  • Bad: Single-prompt forward passes waste GPU parallelism.
  • Bad: No torch.compile() or torch.inference_mode() used anywhere (the latter is faster than torch.no_grad() for inference).

3. Quantization Handling

  • Good: Detects bitsandbytes 4-bit/8-bit and dequantizes before projection.
  • Good: Refuses to operate on raw quantized bytes (avoids silent corruption).
  • Moderate: Full dequantize/re-quantize per direction per weight matrix. Could cache across multi-direction projections.

Top 5 Recommendations (Ranked by Impact)

1. Batch _collect_activations (CRITICAL β€” 5-15x PROBE speedup)

# Current: one prompt at a time
for i, prompt in enumerate(prompts):
    inputs = tokenizer(prompt, ...)
    model(**inputs)

# Proposed: micro-batched
for batch_start in range(0, len(prompts), batch_size):
    batch = prompts[batch_start:batch_start+batch_size]
    inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        model(**inputs)

Hooks need a minor adjustment to handle batch dimension, but the core change is ~20 lines.

2. Add early-stopping to true_iterative_refinement (HIGH β€” saves 1-2 full PROBE passes)

After re-distilling, compute cosine similarity between old and new refusal directions. If all directions are >0.99 cosine, skip remaining passes. Expected to save 30-60% of aggressive mode runtime.

3. Move SAE training to GPU (HIGH β€” 5-15 min saved for surgical/inverted)

Change device="cpu" to auto-detect available GPU. The SAE is small (32M params at expansion=4) and fits easily alongside the model.

4. Reduce Bayesian trial overhead (HIGH β€” saves 30-60 min for optimized)

Options:

  • Reduce n_refusal_prompts from 8-30 to 4-6 (generation is expensive)
  • Use perplexity-only as a faster proxy in early trials, switch to refusal measurement for top candidates
  • Implement batch generation for _measure_refusal_rate

5. Add early-stopping to RDO (MODERATE β€” saves 10-30s for rdo mode)

Monitor loss convergence and break at plateau (delta < 1e-4 for 20 steps). Most directions converge in ~100-200 steps, not 500.


Verdict

The pipeline is architecturally sound β€” the rank-1 projection math is correct and memory-efficient, the stage separation is clean, and the progressive method complexity (basic β†’ nuclear) gives users clear cost/quality tradeoffs. However, the PROBE stage bottleneck (single-prompt forward passes) and Bayesian trial overhead (1750 forward passes) are the two elephants in the room. Fixing just recommendation #1 would make the entire system 3-5x faster for the majority of users who run basic/advanced/aggressive modes.

The optimized and heretic modes have a legitimate place for users with compute budget, but their current efficiency makes them impractical for anything under an A100. The documentation should be more explicit about expected runtimes.

Overall system grade: B+ β€” excellent functionality, needs batching and early-stopping.