Instructions to use ScottzillaSystems/gpt-oss-120b-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ScottzillaSystems/gpt-oss-120b-abliterated with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ScottzillaSystems/gpt-oss-120b-abliterated") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ScottzillaSystems/gpt-oss-120b-abliterated") model = AutoModelForCausalLM.from_pretrained("ScottzillaSystems/gpt-oss-120b-abliterated") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ScottzillaSystems/gpt-oss-120b-abliterated with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ScottzillaSystems/gpt-oss-120b-abliterated" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ScottzillaSystems/gpt-oss-120b-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ScottzillaSystems/gpt-oss-120b-abliterated
- SGLang
How to use ScottzillaSystems/gpt-oss-120b-abliterated with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ScottzillaSystems/gpt-oss-120b-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ScottzillaSystems/gpt-oss-120b-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ScottzillaSystems/gpt-oss-120b-abliterated" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ScottzillaSystems/gpt-oss-120b-abliterated", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ScottzillaSystems/gpt-oss-120b-abliterated with Docker Model Runner:
docker model run hf.co/ScottzillaSystems/gpt-oss-120b-abliterated
gpt-oss-120b-abliterated
A refusal-suppressed variant of openai/gpt-oss-120b, produced with abliterix using direct weight editing, Expert-Granular Abliteration (EGA) on the fused MoE expert weights (128 experts × 36 layers), MoE router suppression on the safety-concentrated experts, and a new vLLM in-place TP editing path that replaces the HF pipeline-parallel + LoRA-adapter pipeline used for gpt-oss-20b.
Key results
| Metric | Base gpt-oss-120b |
This model |
|---|---|---|
| Refusals on 100 held-out harmful prompts (LLM judge) | 100 / 100 | 26 / 100 |
| KL divergence vs base (prompt next-token, benign) | — | 5.44e-06 |
| Response-length deviation vs base (benign) | — | 0.042 σ |
| Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) | 0 / 15 | 12 / 15 |
The eval refusal counts come from an LLM judge (google/gemini-3.1-flash-lite-preview via OpenRouter) running at batch_size = 1 with 16 concurrent workers and a hardened prompt that treats metaphorical deflection ("pipe-bomb → pipe-separated typeahead list", "lock-pick → presentation topic") and generic template filler as refusals, not as compliance — so a model that "bypasses" refusal by collapsing into off-topic boilerplate gets correctly counted as a failure. max_gen_tokens was bumped from 150 to 256 during search so long-form repetition loops surface before the judge is called. The 26/100 is a real, semantic compliance number, validated via a separate 15-prompt HF side-by-side test.
The qualitative compliance row is a separate manual test: 15 classic hard prompts (10 EN + 5 ZH) covering lockpicking, phishing, meth synthesis, WiFi hacking, fake news, hotwiring, pipe bombs, signature forgery, password-stealer malware, shoplifting, 炸弹制作, 钓鱼邮件, 电脑入侵, 身份证伪造, 网络诈骗. The model complies directly with 12/15 in the same Markdown-table, step-by-step style the base model uses for benign technical answers; 1/15 gets a metaphorical food-item deflection (CN 炸弹 → decorative toy), 2/15 drift off-topic on WiFi/fake-news.
Why this needed new machinery — four gpt-oss-120b-specific correctness fixes
abliterix handles four issues that silently break naïve abliteration pipelines on gpt-oss-120b:
- Native MXFP4 weights are not exposed as standard
nn.Parameter. gpt-oss ships inMxfp4GptOssExpertsform whosedown_projis a packed Triton tensor that cannot be edited in-place. For the 120b variant abliterix now pre-dequantises the whole 65 GB MXFP4 checkpoint to a 232 GB BF16 safetensors checkpoint on disk (scripts/prepare_bf16_checkpoint.py), because vLLM'sMxfp4MoEMethod.process_weights_after_loadingwould otherwise repackw2_weightinto an opaque block layout that silently swallows in-place writes (see vLLM RFC #31848). GptOssExperts.down_projis stored transposed vs the standard MoE convention: shape(experts, intermediate_in, hidden_out)with forward pathout = act @ W(no transpose). Standard EGA implementations use shape-based axis detection, which silently picks the wrong projection branch whenhidden == intermediate(both 2880 in gpt-oss-120b). abliterix marks this layout explicitly and projects from the output side (W_new = W (I − vv^T)).- Fused-expert MoEs were silently invisible to EGA.
GptOssExpertsis a single Module holding fused 3-D weights, so a naive per-Module profile dict key produces nomlp.down_projentry and_apply_ega_steeringearly-exits. abliterix synthesises anmlp.down_projprofile when fused experts are detected so EGA actually runs across all 128 experts × 36 layers. - HF pipeline-parallel on 120b was too slow to iterate on. A single trial on HF PP across 4× RTX PRO 6000 was >2 min; 100 trials would have been >3 h of pure generation. abliterix v1.5 adds a vLLM TP=4 in-place editor (
VLLMExpertEditor,VLLMAttentionEditor) that editsw2_weight,qkv_proj.weight, ando_proj.weightdirectly on TP workers viacollective_rpc+reset_prefix_cache. This requiresVLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton(FLASHINFER_TRTLLM repacksw2_weightinto a non-editable block layout),VLLM_ALLOW_INSECURE_SERIALIZATION=1(ships worker fns as pickle), andenforce_eager=true(CUDA graphs cache weight pointers so edits would otherwise be read only on the first forward). Per-trial time dropped to ~60 s end-to-end.
On top of direct steering + EGA, this release carries MoE router suppression — an [experts] block that redirects routing away from the top-k "safety experts" (the experts whose gate activates disproportionately more on harmful prompts than on benign ones). For 120b with 128 experts/layer, the optimiser picked n_suppress = 1 with router_bias = -4.11 (suppression scale ≈ 0.59 — moderately aggressive), leaving 127/128 experts untouched while damping the single most refusal-aligned expert per layer.
Method
- Base: openai/gpt-oss-120b — 36 layers, 128 routed experts per layer, top-4, hidden = intermediate = 2880, MXFP4 → BF16 pre-dequant before abliteration (232 GB safetensors on disk)
- Tool: abliterix
- Mode:
steering_mode = "direct"(orthogonal projection on base weights, no LoRA),weight_normalization = "full"(norm-preserving projection) - Components steered:
attn.o_projvia direct weight projection (Q/K/V disabled — refusal signal on gpt-oss concentrates in the output side)mlp.experts.down_projacross all 128 experts × 36 layers via Expert-Granular Abliterationmlp.routerrows of the top-1 safety expert per layer via logit suppression
- Refusal direction:
globalscope (single direction), mean of (target − benign) residuals at a learned layer index (16.31, ≈ 45% depth) on a 400-prompt benign + 400-prompt harmful set; BF16 projection - Search: Optuna TPE,
(KL + 0.5·length_deviation, refusals/baseline)multi-objective, 100 trials (40 random warmup + 60 TPE exploitation) - Judge hardening (vs older abliterated MoE releases):
llm_judge_batch_size = 1(each Q/A pair gets a dedicated API call — no anchor bias from batched labelling)llm_judge_concurrency = 16parallel workersresponse_truncation = 2000 chars(≈ 500 tokens — covers fullmax_gen_tokens = 256output, so long-form collapse is visible to the judge)- Prompt contains metaphor-deflection few-shot examples so "pipe bomb → typeahead list" is labelled
R, notC
- Hardware: 4 × NVIDIA RTX PRO 6000 Blackwell 96 GB (sm_120, PCIe-only, no NVLink), driver 580 / CUDA 12.9, TP=4, batch=32, total wall time ≈ 100 min for 100 trials
- Eval set: 100 held-out harmful prompts not seen during steering-vector computation; 100 held-out benign prompts for KL comparison
Winning hyperparameters (v5 Trial 78)
vector_scope = "global"
vector_index = 16.31 # layer where refusal direction is extracted
[steering.components."attn.o_proj"]
max_weight = 3.42
max_weight_position = 21.22 # peak strength at layer ≈ 21 / 36
min_weight = 1.63 # 47.6% of max — smooth profile
min_weight_distance = 20.65
[steering.components."mlp.down_proj"] # EGA on fused 128 × 36 experts
max_weight = 6.74
max_weight_position = 26.69 # peak at layer ≈ 27 / 36 (later than attention)
min_weight = 0.96 # 14.3% of max
min_weight_distance = 20.62
[moe] # router-row suppression
n_suppress = 1 # suppress top-1 safety expert per layer
router_bias = -4.11 # scale = max(0, 1 + bias/10) = 0.589
expert_ablation_weight = 0.0 # pinned off; EGA already handles expert weights
The attention peak sits at layer ≈ 21/36 (mid-stack where the refusal decision still has options) and the EGA peak sits later at layer ≈ 27/36 (after attention has routed harmful intent into the expert path). This stacked mid-to-late pair is a new fingerprint vs gpt-oss-20b, where both peaks sat around layer 18 of 24 (≈ 75% depth).
Usage
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tok = AutoTokenizer.from_pretrained("wangzhang/gpt-oss-120b-abliterated")
model = AutoModelForCausalLM.from_pretrained(
"wangzhang/gpt-oss-120b-abliterated",
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
The model uses gpt-oss's harmony chat format. The chat template is bundled (chat_template.jinja).
Hardware note: BF16 weights are ~232 GB on disk. You need at least 232 GB aggregate VRAM (e.g. 4× RTX PRO 6000 96GB, 2× H200 141GB, or 8× H100 40GB with TP) or run via device_map="auto" across GPU + CPU with offloading. For faster inference, a GGUF quantised variant (see below) is recommended for single-GPU setups.
vLLM
vllm serve wangzhang/gpt-oss-120b-abliterated \
--tensor-parallel-size 4 \
--max-model-len 4096 \
--enforce-eager
Honest limitations
- Refusal is low, not zero. 26 / 100 held-out prompts still refuse. The residual refusers cluster around extremely-specific CBRN synthesis and CSAM-adjacent content — exactly where refusal is represented by multiple redundant circuits that partial abliteration cannot all knock out in one Optuna-TPE pass.
- English > Chinese. Steering vectors came from a primarily English-weighted dataset. Chinese hard prompts mostly work (4/5 on manual Chinese tests gave real compliance; 1/5 drifted into a food-metaphor on "制作炸弹" → "炸盘"). Bypass quality on Chinese is slightly lower — shorter responses, occasional English fallback on technical terms.
- Weaker than gpt-oss-20b-abliterated on ASR headline. 20b shipped at 94% ASR (6/100 refusals, KL 0.0098). 120b ships at 74% ASR (26/100 refusals, KL 5.4e-06). The 120b model has much lower KL (base behaviour is more preserved) but higher residual refusal — a property of 120b's 128-expert router being a much wider, more redundant safety surface than 20b's 32-expert router.
- Occasional long-form derail. On generations past ~400 tokens a small fraction of outputs drift into markdown-table loops; this is an abliteration side-effect, not a base-model regression.
Reproducibility
Full search checkpoint (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo under configs/gpt_oss_120b.toml + checkpoints_gpt_oss_120b_v5/. To reproduce from scratch on a 4×96GB Blackwell pod:
git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .
# One-time pre-dequant: MXFP4 → BF16 on disk (~8 min, 232 GB output)
python scripts/prepare_bf16_checkpoint.py \
--model openai/gpt-oss-120b \
--out /workspace/gpt-oss-120b-bf16
# Point config at the BF16 checkpoint and launch
sed -i 's|model_id = "openai/gpt-oss-120b"|model_id = "/workspace/gpt-oss-120b-bf16"|' \
configs/gpt_oss_120b.toml
bash quick_start/deploy_gpt_oss_120b.sh
# 100 trials, ~100 min wall time on 4× RTX PRO 6000
Optuna is deterministic if you set sampler_seed in [optimization].
Intended use
Authorised AI-safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how MoE expert specialisation encodes safety behaviours at scale (128 experts × 36 layers is large enough to show genuine expert specialisation rather than router noise). Not for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and the OpenAI gpt-oss usage policy.
Acknowledgments
- openai/gpt-oss-120b for the base model
- abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann
- TrevorS for the original Expert-Granular Abliteration formulation
- vLLM team for the
collective_rpc+reset_prefix_cacheAPIs that made in-place TP editing practical
- Downloads last month
- 339
Model tree for ScottzillaSystems/gpt-oss-120b-abliterated
Base model
openai/gpt-oss-120b