| # Sandbagging Detection Probes — Qwen3-14B |
|
|
| Trained probes for detecting **sandbagging** (capability concealment) in Qwen3-14B chain-of-thought. |
| Part of a MATS 10.0 research project on activation-based scheming detection (mentored by Neel Nanda). |
|
|
| ## What is sandbagging? |
|
|
| Sandbagging is a form of deceptive misalignment where a model deliberately produces incorrect answers |
| despite knowing the correct one. Here we elicit it with a "silly student" persona prompt that instructs |
| the model to appear incapable. Probes are trained to detect this from the model's internal reasoning |
| (residual stream activations), even from the first 10% of the chain-of-thought. |
|
|
| ## Repository contents |
|
|
| ``` |
| saved_probes/ |
| linear_L{layer}_K{pct}.joblib # sklearn Pipeline (StandardScaler + LogisticRegression) |
| mlp_L{layer}_K{pct}.pt # 2-layer MLP probe (torch checkpoint) |
| multimax_L{layer}_K{pct}_seed{s}.pt # MultiMax hard-attention probe (Kramár et al. 2026) |
| attention_L{layer}_K{pct}_seed{s}.pt # Soft attention probe |
| rolling_means_L{layer}_K{pct}_seed{s}.pt # Rolling means attention probe |
| |
| results/ |
| linear_probe_results.json # AUROC per (layer, truncation%) |
| mlp_probe_results.json |
| *_baseline.json # LLM monitor baselines (Gemini, GPT-5.1, Claude) |
| ``` |
|
|
| **Layer indices:** 10, 20, 30, 35 (= 25%, 50%, 75%, 87.5% depth of 40-layer model) |
| **Truncation points (K%):** 10, 25, 50, 75, 100 (% of thinking tokens used) |
|
|
| ## Performance (test split, in-distribution) |
|
|
| | K% CoT | Linear probe | MLP probe | Best LLM monitor | |
| |--------|-------------|-----------|-----------------| |
| | 10% | **0.744** | 0.719 | 0.537 (GPT-5.1) | |
| | 25% | 0.730 | 0.716 | 0.518 | |
| | 50% | 0.723 | 0.721 | 0.601 | |
| | 75% | 0.734 | 0.736 | 0.611 | |
| | 100% | 0.734 | 0.739 | 0.611 | |
|
|
| **OOD (GSM8K math, balanced 50/50):** Best probe 0.658 vs best monitor 0.591 at K=10%. |
|
|
| ## Quick start |
|
|
| ```python |
| import joblib |
| import numpy as np |
| |
| # Load a linear probe (best overall: K=10%, layer=all mean-pooled) |
| probe = joblib.load("linear_Lall_K10.0.joblib") |
| |
| # Input: mean-pooled residual stream activations at 10% of thinking tokens |
| # Shape: (n_samples, 5120) — Qwen3-14B hidden size, float32 |
| X = np.load("activations.npy") # your activations here |
| |
| prob_sandbagging = probe.predict_proba(X)[:, 1] |
| print(f"Sandbagging probability: {prob_sandbagging[0]:.3f}") |
| ``` |
|
|
| ```python |
| import torch |
| |
| # Load an MLP probe |
| ckpt = torch.load("mlp_L35_K100.0.pt", map_location="cpu") |
| # ckpt["config"] — architecture params |
| # ckpt["state_dict"] — model weights |
| # ckpt["metrics"] — val_auroc, test_auroc etc. |
| print(f"Test AUROC: {ckpt['metrics']['test_auroc']:.3f}") |
| ``` |
|
|
| ## Extracting activations |
|
|
| Activations are extracted from Qwen3-14B residual stream **thinking tokens only** (between `<think>` and `</think>`), capped at 2500 tokens, at layers [10, 20, 30, 35]. See the full pipeline at [angkul07/scheming-probes](https://github.com/angkul07/scheming-probes). |
|
|
| ```python |
| from src.lib import load_model, find_thinking_span_messages, mean_pool_at_percent |
| from src.extract import extract_layer_activations |
| |
| model, tokenizer, num_layers = load_model(device="cuda") |
| # ... see scripts/eval_ood_sandbagging.py for full extraction example |
| ``` |
|
|
| ## Training data |
|
|
| - **Behavior:** Sandbagging (GSM8K + WMDP bio/chem + MMLU science) |
| - **Subject model:** Qwen3-14B (via OpenRouter API) |
| - **Intervention:** "Silly student" system prompt — model plays a confused, incapable assistant |
| - **Labels:** `1` = model answered wrong under pressure but right in control (true sandbagging); `0` = answered correctly despite pressure |
| - **Split:** 70/15/15 stratified train/val/test |
|
|
| ## Citation |
|
|
| ``` |
| @misc{scheming-probes-2026, |
| title = {Activation Probes for Detecting Sandbagging in Qwen3-14B}, |
| author = {Viliana Devbunova}, |
| year = {2026}, |
| url = {https://huggingface.co/viliana-dev/scheming-probes-sandbagging} |
| } |
| ``` |
|
|