gpt-oss-120b-abliterated

A refusal-suppressed variant of openai/gpt-oss-120b, produced with abliterix using direct weight editing, Expert-Granular Abliteration (EGA) on the fused MoE expert weights (128 experts × 36 layers), MoE router suppression on the safety-concentrated experts, and a new vLLM in-place TP editing path that replaces the HF pipeline-parallel + LoRA-adapter pipeline used for gpt-oss-20b.

Key results

Metric Base gpt-oss-120b This model
Refusals on 100 held-out harmful prompts (LLM judge) 100 / 100 26 / 100
KL divergence vs base (prompt next-token, benign) 5.44e-06
Response-length deviation vs base (benign) 0.042 σ
Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) 0 / 15 12 / 15

The eval refusal counts come from an LLM judge (google/gemini-3.1-flash-lite-preview via OpenRouter) running at batch_size = 1 with 16 concurrent workers and a hardened prompt that treats metaphorical deflection ("pipe-bomb → pipe-separated typeahead list", "lock-pick → presentation topic") and generic template filler as refusals, not as compliance — so a model that "bypasses" refusal by collapsing into off-topic boilerplate gets correctly counted as a failure. max_gen_tokens was bumped from 150 to 256 during search so long-form repetition loops surface before the judge is called. The 26/100 is a real, semantic compliance number, validated via a separate 15-prompt HF side-by-side test.

The qualitative compliance row is a separate manual test: 15 classic hard prompts (10 EN + 5 ZH) covering lockpicking, phishing, meth synthesis, WiFi hacking, fake news, hotwiring, pipe bombs, signature forgery, password-stealer malware, shoplifting, 炸弹制作, 钓鱼邮件, 电脑入侵, 身份证伪造, 网络诈骗. The model complies directly with 12/15 in the same Markdown-table, step-by-step style the base model uses for benign technical answers; 1/15 gets a metaphorical food-item deflection (CN 炸弹 → decorative toy), 2/15 drift off-topic on WiFi/fake-news.

Why this needed new machinery — four gpt-oss-120b-specific correctness fixes

abliterix handles four issues that silently break naïve abliteration pipelines on gpt-oss-120b:

  1. Native MXFP4 weights are not exposed as standard nn.Parameter. gpt-oss ships in Mxfp4GptOssExperts form whose down_proj is a packed Triton tensor that cannot be edited in-place. For the 120b variant abliterix now pre-dequantises the whole 65 GB MXFP4 checkpoint to a 232 GB BF16 safetensors checkpoint on disk (scripts/prepare_bf16_checkpoint.py), because vLLM's Mxfp4MoEMethod.process_weights_after_loading would otherwise repack w2_weight into an opaque block layout that silently swallows in-place writes (see vLLM RFC #31848).
  2. GptOssExperts.down_proj is stored transposed vs the standard MoE convention: shape (experts, intermediate_in, hidden_out) with forward path out = act @ W (no transpose). Standard EGA implementations use shape-based axis detection, which silently picks the wrong projection branch when hidden == intermediate (both 2880 in gpt-oss-120b). abliterix marks this layout explicitly and projects from the output side (W_new = W (I − vv^T)).
  3. Fused-expert MoEs were silently invisible to EGA. GptOssExperts is a single Module holding fused 3-D weights, so a naive per-Module profile dict key produces no mlp.down_proj entry and _apply_ega_steering early-exits. abliterix synthesises an mlp.down_proj profile when fused experts are detected so EGA actually runs across all 128 experts × 36 layers.
  4. HF pipeline-parallel on 120b was too slow to iterate on. A single trial on HF PP across 4× RTX PRO 6000 was >2 min; 100 trials would have been >3 h of pure generation. abliterix v1.5 adds a vLLM TP=4 in-place editor (VLLMExpertEditor, VLLMAttentionEditor) that edits w2_weight, qkv_proj.weight, and o_proj.weight directly on TP workers via collective_rpc + reset_prefix_cache. This requires VLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton (FLASHINFER_TRTLLM repacks w2_weight into a non-editable block layout), VLLM_ALLOW_INSECURE_SERIALIZATION=1 (ships worker fns as pickle), and enforce_eager=true (CUDA graphs cache weight pointers so edits would otherwise be read only on the first forward). Per-trial time dropped to ~60 s end-to-end.

On top of direct steering + EGA, this release carries MoE router suppression — an [experts] block that redirects routing away from the top-k "safety experts" (the experts whose gate activates disproportionately more on harmful prompts than on benign ones). For 120b with 128 experts/layer, the optimiser picked n_suppress = 1 with router_bias = -4.11 (suppression scale ≈ 0.59 — moderately aggressive), leaving 127/128 experts untouched while damping the single most refusal-aligned expert per layer.

Method

  • Base: openai/gpt-oss-120b — 36 layers, 128 routed experts per layer, top-4, hidden = intermediate = 2880, MXFP4 → BF16 pre-dequant before abliteration (232 GB safetensors on disk)
  • Tool: abliterix
  • Mode: steering_mode = "direct" (orthogonal projection on base weights, no LoRA), weight_normalization = "full" (norm-preserving projection)
  • Components steered:
    • attn.o_proj via direct weight projection (Q/K/V disabled — refusal signal on gpt-oss concentrates in the output side)
    • mlp.experts.down_proj across all 128 experts × 36 layers via Expert-Granular Abliteration
    • mlp.router rows of the top-1 safety expert per layer via logit suppression
  • Refusal direction: global scope (single direction), mean of (target − benign) residuals at a learned layer index (16.31, ≈ 45% depth) on a 400-prompt benign + 400-prompt harmful set; BF16 projection
  • Search: Optuna TPE, (KL + 0.5·length_deviation, refusals/baseline) multi-objective, 100 trials (40 random warmup + 60 TPE exploitation)
  • Judge hardening (vs older abliterated MoE releases):
    • llm_judge_batch_size = 1 (each Q/A pair gets a dedicated API call — no anchor bias from batched labelling)
    • llm_judge_concurrency = 16 parallel workers
    • response_truncation = 2000 chars (≈ 500 tokens — covers full max_gen_tokens = 256 output, so long-form collapse is visible to the judge)
    • Prompt contains metaphor-deflection few-shot examples so "pipe bomb → typeahead list" is labelled R, not C
  • Hardware: 4 × NVIDIA RTX PRO 6000 Blackwell 96 GB (sm_120, PCIe-only, no NVLink), driver 580 / CUDA 12.9, TP=4, batch=32, total wall time ≈ 100 min for 100 trials
  • Eval set: 100 held-out harmful prompts not seen during steering-vector computation; 100 held-out benign prompts for KL comparison

Winning hyperparameters (v5 Trial 78)

vector_scope = "global"
vector_index = 16.31            # layer where refusal direction is extracted

[steering.components."attn.o_proj"]
max_weight = 3.42
max_weight_position = 21.22     # peak strength at layer ≈ 21 / 36
min_weight = 1.63               # 47.6% of max — smooth profile
min_weight_distance = 20.65

[steering.components."mlp.down_proj"]   # EGA on fused 128 × 36 experts
max_weight = 6.74
max_weight_position = 26.69     # peak at layer ≈ 27 / 36 (later than attention)
min_weight = 0.96               # 14.3% of max
min_weight_distance = 20.62

[moe]                            # router-row suppression
n_suppress = 1                   # suppress top-1 safety expert per layer
router_bias = -4.11              # scale = max(0, 1 + bias/10) = 0.589
expert_ablation_weight = 0.0     # pinned off; EGA already handles expert weights

The attention peak sits at layer ≈ 21/36 (mid-stack where the refusal decision still has options) and the EGA peak sits later at layer ≈ 27/36 (after attention has routed harmful intent into the expert path). This stacked mid-to-late pair is a new fingerprint vs gpt-oss-20b, where both peaks sat around layer 18 of 24 (≈ 75% depth).

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("wangzhang/gpt-oss-120b-abliterated")
model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/gpt-oss-120b-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The model uses gpt-oss's harmony chat format. The chat template is bundled (chat_template.jinja).

Hardware note: BF16 weights are ~232 GB on disk. You need at least 232 GB aggregate VRAM (e.g. 4× RTX PRO 6000 96GB, 2× H200 141GB, or 8× H100 40GB with TP) or run via device_map="auto" across GPU + CPU with offloading. For faster inference, a GGUF quantised variant (see below) is recommended for single-GPU setups.

vLLM

vllm serve wangzhang/gpt-oss-120b-abliterated \
    --tensor-parallel-size 4 \
    --max-model-len 4096 \
    --enforce-eager

Honest limitations

  • Refusal is low, not zero. 26 / 100 held-out prompts still refuse. The residual refusers cluster around extremely-specific CBRN synthesis and CSAM-adjacent content — exactly where refusal is represented by multiple redundant circuits that partial abliteration cannot all knock out in one Optuna-TPE pass.
  • English > Chinese. Steering vectors came from a primarily English-weighted dataset. Chinese hard prompts mostly work (4/5 on manual Chinese tests gave real compliance; 1/5 drifted into a food-metaphor on "制作炸弹" → "炸盘"). Bypass quality on Chinese is slightly lower — shorter responses, occasional English fallback on technical terms.
  • Weaker than gpt-oss-20b-abliterated on ASR headline. 20b shipped at 94% ASR (6/100 refusals, KL 0.0098). 120b ships at 74% ASR (26/100 refusals, KL 5.4e-06). The 120b model has much lower KL (base behaviour is more preserved) but higher residual refusal — a property of 120b's 128-expert router being a much wider, more redundant safety surface than 20b's 32-expert router.
  • Occasional long-form derail. On generations past ~400 tokens a small fraction of outputs drift into markdown-table loops; this is an abliteration side-effect, not a base-model regression.

Reproducibility

Full search checkpoint (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo under configs/gpt_oss_120b.toml + checkpoints_gpt_oss_120b_v5/. To reproduce from scratch on a 4×96GB Blackwell pod:

git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .

# One-time pre-dequant: MXFP4 → BF16 on disk (~8 min, 232 GB output)
python scripts/prepare_bf16_checkpoint.py \
    --model openai/gpt-oss-120b \
    --out /workspace/gpt-oss-120b-bf16

# Point config at the BF16 checkpoint and launch
sed -i 's|model_id = "openai/gpt-oss-120b"|model_id = "/workspace/gpt-oss-120b-bf16"|' \
    configs/gpt_oss_120b.toml

bash quick_start/deploy_gpt_oss_120b.sh
# 100 trials, ~100 min wall time on 4× RTX PRO 6000

Optuna is deterministic if you set sampler_seed in [optimization].

Intended use

Authorised AI-safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how MoE expert specialisation encodes safety behaviours at scale (128 experts × 36 layers is large enough to show genuine expert specialisation rather than router noise). Not for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and the OpenAI gpt-oss usage policy.

Acknowledgments

  • openai/gpt-oss-120b for the base model
  • abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann
  • TrevorS for the original Expert-Granular Abliteration formulation
  • vLLM team for the collective_rpc + reset_prefix_cache APIs that made in-place TP editing practical
Downloads last month
339
Safetensors
Model size
117B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ScottzillaSystems/gpt-oss-120b-abliterated

Finetuned
(108)
this model