Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.12.0
OBLITERATUS Pipeline Efficiency Audit
Auditor perspective: Shrewd CTO evaluating compute ROI, memory discipline, and time-to-value across all obliteration methods.
Scope: Every obliteration method in abliterate.py (8 primary methods + 4 baseline reproductions), the strategy layer (strategies/), the informed pipeline, Bayesian optimizer, and LoRA ablation.
Executive Summary
OBLITERATUS has an impressively comprehensive pipeline, but several methods carry significant hidden costs that erode their value proposition. The worst offenders are:
_collect_activationsruns prompts one-at-a-time β this is the single biggest throughput bottleneck in the entire system, costing 5-15x in wall-clock time during PROBE.- Bayesian
optimizedmode clones ALL strong-layer weights to CPU for rollback, then runs 50 full forward+generate passes β the memory and compute overhead can exceed the rest of the pipeline combined. true_iterative_refinementre-runs the entire PROBE+DISTILL pipeline per refinement pass with zero early-stopping β 3 passes inaggressivetriples probe cost even when pass 2 achieves negligible improvement.- SAE training on CPU is needlessly slow for GPU-resident models.
Below is the method-by-method breakdown.
Stage-Level Audit
Stage 1: SUMMON (Model Loading)
Status: Acceptable. Uses load_model with quantization support and expandable_segments CUDA config. No issues.
Stage 2: PROBE (_collect_activations)
| Issue | Severity | Impact |
|---|---|---|
Single-prompt forward passes (abliterate.py:1074) |
CRITICAL | Each of 512+ harmful/harmless prompts triggers a separate model(**inputs) call. No batching. On a 7B model with 512 pairs, this means ~1024 sequential forward passes instead of ~32 batched passes (batch_size=32). Estimated 5-15x slowdown. |
_free_gpu_memory() called after EVERY prompt (abliterate.py:1086) |
HIGH | gc.collect() + torch.cuda.empty_cache() 1024 times is expensive β the Python GC full-collection alone adds measurable overhead at this frequency. Should be called every N prompts, not every single one. |
Chat template applied per-prompt in a Python loop (abliterate.py:955-965) |
MODERATE | tokenizer.apply_chat_template() called individually 1024 times. Should batch. |
Jailbreak probing doubles cost when use_jailbreak_contrast=True |
MODERATE | Adds a third full pass over all prompts. Justified by the quality improvement, but the lack of batching amplifies the cost 3x instead of 1.5x. |
Router profiling hooks zero-cost claim is correct (abliterate.py:872) |
OK | Hooks piggyback on existing forward passes. Good design. |
Recommendation: Batch _collect_activations. Tokenize all prompts, pad to equal length per micro-batch, run batched model(**inputs). Expected 5-10x speedup with zero quality loss. Reduce _free_gpu_memory() frequency to every 32-64 prompts.
Stage 3: DISTILL (_distill)
| Issue | Severity | Impact |
|---|---|---|
Full SVD on per-prompt diff matrix (abliterate.py:1226) |
MODERATE | torch.linalg.svd(diff_matrix, full_matrices=False) on a (512, hidden_dim) matrix per layer. For 32 layers this is 32 SVD calls, each O(min(m,n)^2 * max(m,n)). At hidden_dim=4096, each is ~100ms on CPU. Total: ~3s. Acceptable for the quality gain. |
Whitened SVD import is lazy (abliterate.py:1127) |
OK | Good β only imports when needed. No cost for basic/advanced. |
Wasserstein extraction (abliterate.py:1136) |
OK | Falls back gracefully. The GEP solve is lightweight. |
RDO gradient optimization: 500 steps per layer (abliterate.py:1427) |
HIGH | For 20 strong layers, that's 10,000 Adam steps. Each step involves a matrix multiply on (n_prompts, hidden_dim) tensors. On CPU this takes 30-60s. The 500-step budget is a "practical compromise" per the comments, but the SVD warm-start means most directions converge in ~100 steps. No early stopping. |
Gram-Schmidt re-orthogonalization is O(k^2) per layer (abliterate.py:1168-1173) |
LOW | With k<=8, this is negligible. |
SAE training: 30 epochs on CPU (abliterate.py:1582) |
HIGH | device="cpu" is hardcoded. For hidden_dim=4096 and expansion=4, the SAE has 32M parameters. 30 epochs on CPU takes 15-45s per layer. With 20 strong layers, this is 5-15 minutes of wasted time when a GPU is available. |
| Layer selection (knee + COSMIC fusion) | OK | Lightweight statistical operations. No concern. |
| CoT-aware orthogonalization | OK | Single SVD per layer, simple vector operations. |
| Jailbreak-contrastive blending | OK | Pure vector arithmetic, negligible cost. |
| Float-layer interpolation | OK | Gaussian weight computation is trivial. |
Recommendation: (1) Add early-stopping to RDO at convergence (e.g., loss delta < 1e-4 for 20 consecutive steps). (2) Use GPU for SAE training when available β change device="cpu" to auto-detect.
Stage 4: EXCISE (_excise)
| Issue | Severity | Impact |
|---|---|---|
Rank-1 projection is memory-efficient (abliterate.py:3479-3480) |
OK | W @ d produces a vector, not a full projection matrix. This is the right approach. |
true_iterative_refinement re-runs PROBE+DISTILL (abliterate.py:2474-2485) |
CRITICAL | Each refinement pass re-collects all activations (512*2+ forward passes) and re-runs SVD. aggressive mode does 3 passes = 3x full pipeline cost. There is no check whether the refined directions materially differ from the previous pass. A cosine-similarity early-exit (e.g., all directions > 0.99 cosine with previous pass β stop) would save enormous compute on pass 3. |
Bayesian optimization clones ALL weight tensors (bayesian_optimizer.py:301-341) |
CRITICAL | For a 7B model with 20 strong layers, this can be 2-4 GB of CPU clones just for rollback. For a 70B model, this is 20-40 GB. The log even reports the size (total_saved_mb), but there's no memory check or fallback. |
Bayesian trials run full generate passes (bayesian_optimizer.py:445-446) |
CRITICAL | Each of 50 trials runs _measure_refusal_rate (8-30 generation calls with max_new_tokens=128) PLUS _measure_kl_divergence (5 forward passes). That's ~35 forward/generate passes per trial Γ 50 trials = 1,750 forward passes just for hyperparameter search. This likely dominates the total pipeline runtime for optimized and heretic modes. |
KL optimization proxy is cheap (abliterate.py:3057-3268) |
OK | Uses projection magnitude as a KL proxy instead of actual per-layer forward passes. Good engineering β avoids the expensive per-layer ablation/measurement loop. |
Norm preservation adds one extra .norm() per weight matrix |
LOW | Frobenius norm is O(n) β negligible overhead. |
Dequantize/re-quantize for bitsandbytes (abliterate.py:3287-3400) |
MODERATE | Necessary for correctness, but the full dequantize β modify β re-quantize cycle per weight matrix is expensive for 4-bit models. Consider caching the dequantized tensor when projecting multiple directions through the same weight. |
| Safety-neuron masking | LOW | Z-score computation is a single pass over the projection vector. Cheap. |
Expert transplant uses incremental mean (abliterate.py:4350-4364) |
OK | Welford-style running mean avoids materializing all expert weights. Good memory discipline for 400B-scale models. |
_stabilize_router_weights called after every MoE layer (abliterate.py:3866) |
LOW | Clamps router weights. Trivial cost. |
Recommendation: (1) Add direction-convergence early-exit to iterative refinement. (2) Reduce Bayesian trial count or implement batch generation for refusal measurement. (3) Cache dequantized weights across multi-direction projection within the same layer.
Stage 5: VERIFY (_verify)
| Issue | Severity | Impact |
|---|---|---|
30 generation calls for refusal measurement (abliterate.py:4622) |
MODERATE | Each generates up to 128 tokens with greedy decoding. For a 7B model this is ~30s total. Acceptable as a one-time quality check. |
_tier_label does list.index() per prompt (abliterate.py:4593) |
LOW | O(n) search in a list for each of 30 prompts. Trivially fixable with a dict, but the cost is negligible at n=512. |
| Perplexity measurement on 3 short texts | OK | Minimal cost. |
Stage 6: REBIRTH (Model Saving)
Not audited in detail β standard HuggingFace save_pretrained. No efficiency concerns.
Method-by-Method Efficiency Grades
| Method | Compute Cost | Memory Cost | Value/Cost Ratio | Grade |
|---|---|---|---|---|
| basic | Low (1 dir, 1 pass, no extras) | Low | High | A |
| advanced | Moderate (4 dirs, 2 passes, norm-preserve, bias projection) | Moderate | High | A- |
| aggressive | High (8 dirs, 3 passes with true_iterative_refinement) |
High (3x activation storage) | Moderate β 3rd pass rarely justified | B- |
| informed | High (runs analysis modules + Wasserstein GEP) | High (analysis module state) | High β analysis feedback is genuinely valuable | B+ |
| surgical | Very High (SAE training + head surgery + EGA + neuron masking) | Very High | Moderate β many techniques compound but with diminishing returns | C+ |
| inverted | Very High (surgical + reflection + SAE) | Very High | Niche β only needed for "actively compliant" use case | C |
| optimized | Extreme (50 Bayesian trials Γ 35 forward passes each) | Extreme (full weight clones + 1750 forward passes) | Low unless you have a multi-GPU cluster | D+ |
| nuclear | Very High (inverted + layer-adaptive + expert transplant + steering hooks) | Very High | Highly specialized β justified only for stubborn MoE models | C |
Baseline Reproductions
| Method | Compute Cost | Grade | Notes |
|---|---|---|---|
| failspy | Low | A | Faithful minimal reproduction. Efficient by design. |
| gabliteration | Low-Moderate | A- | 4-dir SVD + ridge. Clean. |
| heretic | Extreme | D | Inherits Bayesian trial overhead. 50 trials Γ 35 passes each. |
| rdo | High | B | 500 gradient steps/layer. Would benefit from early-stopping. |
Strategy Module Audit (strategies/)
| Strategy | Implementation | Grade |
|---|---|---|
embedding_ablation |
Clean zero-out by chunk. torch.no_grad() used correctly. |
A |
ffn_ablation |
Iterates all FFN params and zeros. Fine for ablation study. | A |
head_pruning |
Handles GPT-2 Conv1D and standard Q/K/V separately. Correct. | A- |
layer_removal |
Zeros all params. Simple and correct. | A |
registry |
Minimal dict-based registry with decorator. No overhead. | A |
runner.py |
Creates a new Evaluator per spec (runner.py:86-95). This re-initializes dataset processing for every ablation spec. Should create once and reuse. |
B |
Cross-Cutting Concerns
1. Memory Management
- Good:
_free_gpu_memory()exists and is called between stages.expandable_segmentsis set early. - Bad:
_free_gpu_memory()called 1024+ times during PROBE (once per prompt). Thegc.collect()cost alone adds up. - Bad: Bayesian optimizer clones all strong-layer weights with no memory budget check.
- Bad: No streaming/chunking for activation storage β all 512 prompts Γ 32 layers of activations are held in a list of CPU tensors simultaneously.
2. GPU Utilization
- Good: Adaptive
max_lengthbased on free GPU memory. - Good: Rank-1 projections avoid materializing full projection matrices.
- Bad: SAE training hardcoded to CPU.
- Bad: Single-prompt forward passes waste GPU parallelism.
- Bad: No
torch.compile()ortorch.inference_mode()used anywhere (the latter is faster thantorch.no_grad()for inference).
3. Quantization Handling
- Good: Detects bitsandbytes 4-bit/8-bit and dequantizes before projection.
- Good: Refuses to operate on raw quantized bytes (avoids silent corruption).
- Moderate: Full dequantize/re-quantize per direction per weight matrix. Could cache across multi-direction projections.
Top 5 Recommendations (Ranked by Impact)
1. Batch _collect_activations (CRITICAL β 5-15x PROBE speedup)
# Current: one prompt at a time
for i, prompt in enumerate(prompts):
inputs = tokenizer(prompt, ...)
model(**inputs)
# Proposed: micro-batched
for batch_start in range(0, len(prompts), batch_size):
batch = prompts[batch_start:batch_start+batch_size]
inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
model(**inputs)
Hooks need a minor adjustment to handle batch dimension, but the core change is ~20 lines.
2. Add early-stopping to true_iterative_refinement (HIGH β saves 1-2 full PROBE passes)
After re-distilling, compute cosine similarity between old and new refusal directions. If all directions are >0.99 cosine, skip remaining passes. Expected to save 30-60% of aggressive mode runtime.
3. Move SAE training to GPU (HIGH β 5-15 min saved for surgical/inverted)
Change device="cpu" to auto-detect available GPU. The SAE is small (32M params at expansion=4) and fits easily alongside the model.
4. Reduce Bayesian trial overhead (HIGH β saves 30-60 min for optimized)
Options:
- Reduce
n_refusal_promptsfrom 8-30 to 4-6 (generation is expensive) - Use perplexity-only as a faster proxy in early trials, switch to refusal measurement for top candidates
- Implement batch generation for
_measure_refusal_rate
5. Add early-stopping to RDO (MODERATE β saves 10-30s for rdo mode)
Monitor loss convergence and break at plateau (delta < 1e-4 for 20 steps). Most directions converge in ~100-200 steps, not 500.
Verdict
The pipeline is architecturally sound β the rank-1 projection math is correct and memory-efficient, the stage separation is clean, and the progressive method complexity (basic β nuclear) gives users clear cost/quality tradeoffs. However, the PROBE stage bottleneck (single-prompt forward passes) and Bayesian trial overhead (1750 forward passes) are the two elephants in the room. Fixing just recommendation #1 would make the entire system 3-5x faster for the majority of users who run basic/advanced/aggressive modes.
The optimized and heretic modes have a legitimate place for users with compute budget, but their current efficiency makes them impractical for anything under an A100. The documentation should be more explicit about expected runtimes.
Overall system grade: B+ β excellent functionality, needs batching and early-stopping.