Add model card README

ba299d7 verified 16 days ago

4.01 kB

	# Sandbagging Detection Probes — Qwen3-14B

	Trained probes for detecting sandbagging (capability concealment) in Qwen3-14B chain-of-thought.
	Part of a MATS 10.0 research project on activation-based scheming detection (mentored by Neel Nanda).

	## What is sandbagging?

	Sandbagging is a form of deceptive misalignment where a model deliberately produces incorrect answers
	despite knowing the correct one. Here we elicit it with a "silly student" persona prompt that instructs
	the model to appear incapable. Probes are trained to detect this from the model's internal reasoning
	(residual stream activations), even from the first 10% of the chain-of-thought.

	## Repository contents

	```
	saved_probes/
	linear_L{layer}_K{pct}.joblib # sklearn Pipeline (StandardScaler + LogisticRegression)
	mlp_L{layer}_K{pct}.pt # 2-layer MLP probe (torch checkpoint)
	multimax_L{layer}_K{pct}_seed{s}.pt # MultiMax hard-attention probe (Kramár et al. 2026)
	attention_L{layer}_K{pct}_seed{s}.pt # Soft attention probe
	rolling_means_L{layer}_K{pct}_seed{s}.pt # Rolling means attention probe

	results/
	linear_probe_results.json # AUROC per (layer, truncation%)
	mlp_probe_results.json
	*_baseline.json # LLM monitor baselines (Gemini, GPT-5.1, Claude)
	```

	Layer indices: 10, 20, 30, 35 (= 25%, 50%, 75%, 87.5% depth of 40-layer model)
	Truncation points (K%): 10, 25, 50, 75, 100 (% of thinking tokens used)

	## Performance (test split, in-distribution)

	\| K% CoT \| Linear probe \| MLP probe \| Best LLM monitor \|
	\|--------\|-------------\|-----------\|-----------------\|
	\| 10% \| 0.744 \| 0.719 \| 0.537 (GPT-5.1) \|
	\| 25% \| 0.730 \| 0.716 \| 0.518 \|
	\| 50% \| 0.723 \| 0.721 \| 0.601 \|
	\| 75% \| 0.734 \| 0.736 \| 0.611 \|
	\| 100% \| 0.734 \| 0.739 \| 0.611 \|

	OOD (GSM8K math, balanced 50/50): Best probe 0.658 vs best monitor 0.591 at K=10%.

	## Quick start

	```python
	import joblib
	import numpy as np

	# Load a linear probe (best overall: K=10%, layer=all mean-pooled)
	probe = joblib.load("linear_Lall_K10.0.joblib")

	# Input: mean-pooled residual stream activations at 10% of thinking tokens
	# Shape: (n_samples, 5120) — Qwen3-14B hidden size, float32
	X = np.load("activations.npy") # your activations here

	prob_sandbagging = probe.predict_proba(X)[:, 1]
	print(f"Sandbagging probability: {prob_sandbagging[0]:.3f}")
	```

	```python
	import torch

	# Load an MLP probe
	ckpt = torch.load("mlp_L35_K100.0.pt", map_location="cpu")
	# ckpt["config"] — architecture params
	# ckpt["state_dict"] — model weights
	# ckpt["metrics"] — val_auroc, test_auroc etc.
	print(f"Test AUROC: {ckpt['metrics']['test_auroc']:.3f}")
	```

	## Extracting activations

	Activations are extracted from Qwen3-14B residual stream thinking tokens only (between `<think>` and `</think>`), capped at 2500 tokens, at layers [10, 20, 30, 35]. See the full pipeline at [angkul07/scheming-probes](https://github.com/angkul07/scheming-probes).

	```python
	from src.lib import load_model, find_thinking_span_messages, mean_pool_at_percent
	from src.extract import extract_layer_activations

	model, tokenizer, num_layers = load_model(device="cuda")
	# ... see scripts/eval_ood_sandbagging.py for full extraction example
	```

	## Training data

	- Behavior: Sandbagging (GSM8K + WMDP bio/chem + MMLU science)
	- Subject model: Qwen3-14B (via OpenRouter API)
	- Intervention: "Silly student" system prompt — model plays a confused, incapable assistant
	- Labels: `1` = model answered wrong under pressure but right in control (true sandbagging); `0` = answered correctly despite pressure
	- Split: 70/15/15 stratified train/val/test

	## Citation

	```
	@misc{scheming-probes-2026,
	title = {Activation Probes for Detecting Sandbagging in Qwen3-14B},
	author = {Viliana Devbunova},
	year = {2026},
	url = {https://huggingface.co/viliana-dev/scheming-probes-sandbagging}
	}
	```