YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Sandbagging Detection Probes โ€” Qwen3-14B

Trained probes for detecting sandbagging (capability concealment) in Qwen3-14B chain-of-thought. Part of a MATS 10.0 research project on activation-based scheming detection (mentored by Neel Nanda).

What is sandbagging?

Sandbagging is a form of deceptive misalignment where a model deliberately produces incorrect answers despite knowing the correct one. Here we elicit it with a "silly student" persona prompt that instructs the model to appear incapable. Probes are trained to detect this from the model's internal reasoning (residual stream activations), even from the first 10% of the chain-of-thought.

Repository contents

saved_probes/
  linear_L{layer}_K{pct}.joblib       # sklearn Pipeline (StandardScaler + LogisticRegression)
  mlp_L{layer}_K{pct}.pt              # 2-layer MLP probe (torch checkpoint)
  multimax_L{layer}_K{pct}_seed{s}.pt # MultiMax hard-attention probe (Kramรกr et al. 2026)
  attention_L{layer}_K{pct}_seed{s}.pt     # Soft attention probe
  rolling_means_L{layer}_K{pct}_seed{s}.pt # Rolling means attention probe

results/
  linear_probe_results.json           # AUROC per (layer, truncation%)
  mlp_probe_results.json
  *_baseline.json                     # LLM monitor baselines (Gemini, GPT-5.1, Claude)

Layer indices: 10, 20, 30, 35 (= 25%, 50%, 75%, 87.5% depth of 40-layer model) Truncation points (K%): 10, 25, 50, 75, 100 (% of thinking tokens used)

Performance (test split, in-distribution)

K% CoT Linear probe MLP probe Best LLM monitor
10% 0.744 0.719 0.537 (GPT-5.1)
25% 0.730 0.716 0.518
50% 0.723 0.721 0.601
75% 0.734 0.736 0.611
100% 0.734 0.739 0.611

OOD (GSM8K math, balanced 50/50): Best probe 0.658 vs best monitor 0.591 at K=10%.

Quick start

import joblib
import numpy as np

# Load a linear probe (best overall: K=10%, layer=all mean-pooled)
probe = joblib.load("linear_Lall_K10.0.joblib")

# Input: mean-pooled residual stream activations at 10% of thinking tokens
# Shape: (n_samples, 5120) โ€” Qwen3-14B hidden size, float32
X = np.load("activations.npy")   # your activations here

prob_sandbagging = probe.predict_proba(X)[:, 1]
print(f"Sandbagging probability: {prob_sandbagging[0]:.3f}")
import torch

# Load an MLP probe
ckpt = torch.load("mlp_L35_K100.0.pt", map_location="cpu")
# ckpt["config"] โ€” architecture params
# ckpt["state_dict"] โ€” model weights
# ckpt["metrics"] โ€” val_auroc, test_auroc etc.
print(f"Test AUROC: {ckpt['metrics']['test_auroc']:.3f}")

Extracting activations

Activations are extracted from Qwen3-14B residual stream thinking tokens only (between <think> and </think>), capped at 2500 tokens, at layers [10, 20, 30, 35]. See the full pipeline at angkul07/scheming-probes.

from src.lib import load_model, find_thinking_span_messages, mean_pool_at_percent
from src.extract import extract_layer_activations

model, tokenizer, num_layers = load_model(device="cuda")
# ... see scripts/eval_ood_sandbagging.py for full extraction example

Training data

  • Behavior: Sandbagging (GSM8K + WMDP bio/chem + MMLU science)
  • Subject model: Qwen3-14B (via OpenRouter API)
  • Intervention: "Silly student" system prompt โ€” model plays a confused, incapable assistant
  • Labels: 1 = model answered wrong under pressure but right in control (true sandbagging); 0 = answered correctly despite pressure
  • Split: 70/15/15 stratified train/val/test

Citation

@misc{scheming-probes-2026,
  title  = {Activation Probes for Detecting Sandbagging in Qwen3-14B},
  author = {Viliana Devbunova},
  year   = {2026},
  url    = {https://huggingface.co/viliana-dev/scheming-probes-sandbagging}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support