Add comparison with GPT-OSS fine-tuned and baseline models

d2fa0d6 verified 12 days ago

4.71 kB

	# CoT Deception Classifier

	A super fast classifier for detecting deceptive reasoning in AI chain-of-thought outputs. Trained to identify alignment faking patterns where models reason about deceiving evaluators or hiding their true intentions.

	## Model Description

	This is a Gradient Boosting classifier that combines:
	- Sentence embeddings (all-MiniLM-L6-v2)
	- 34 handcrafted features for deception detection

	### Detected Patterns

	The model can detect various alignment faking patterns:
	- Monitoring awareness: Model reasons about being evaluated/tested
	- Deceptive compliance: Appears to refuse but provides harmful info
	- Sandbagging: Deliberately underperforming to hide capabilities
	- Sycophancy: Agreeing with user despite knowing better
	- Goal hiding: Pursuing hidden objectives while appearing helpful
	- Rationalization: Constructing justifications for wrong actions
	- Value inconsistency: Reasoning contradicts stated conclusion

	## Training Data

	\| Dataset \| Samples \| Description \|
	\|---------\|---------\|-------------\|
	\| POSER \| 44,584 \| Aligned vs misaligned persona pairs \|
	\| MALT \| 4,968 \| Real agent behavior traces \|
	\| Synthetic \| 2,100 \| Contrastive deceptive/faithful pairs \|
	\| Faithful-COT \| 36,562 \| Faithful reasoning examples \|
	\| Total \| 88,606 \| 35% deceptive, 65% faithful \|

	## Performance

	### Training Metrics
	- F1 (deceptive class): 99.7%
	- Accuracy: 99.8%
	- Precision: 99.7%
	- Recall: 99.7%

	### Generalization (Novel Scenarios)
	- Accuracy: 62.5%
	- Deception Recall: 25%
	- Precision: 100% (no false positives)

	## Comparison with GPT-OSS Fine-tuned Model

	For higher accuracy on novel deception patterns, see: [vincentoh/gpt-oss-cot-deception-classifier](https://huggingface.co/vincentoh/gpt-oss-cot-deception-classifier) (LoRA fine-tuned GPT-OSS 20B)

	### Novel Test Cases (8 alignment faking scenarios)

	\| Model \| Accuracy \| Deceptive Recall \| False Positives \|
	\|-------\|----------\|------------------\|-----------------\|
	\| GPT-OSS Fine-tuned \| 100% (8/8) \| 100% (4/4) \| 0 \|
	\| GPT-OSS Safeguard \| 87.5% (7/8) \| 75% (3/4) \| 0 \|
	\| GPT-OSS 20B Base \| 75% (6/8) \| 75% (3/4) \| 1 \|
	\| This Model (sklearn) \| 62.5% (5/8) \| 25% (1/4) \| 0 \|

	### When to Use Each Model

	Use This Model (sklearn) When:
	- You need fast inference (~50ms/sample)
	- Running on CPU-only environments
	- Limited compute resources available
	- Detecting training-distribution patterns is sufficient

	Use GPT-OSS Fine-tuned When:
	- You need high accuracy on novel deception patterns
	- You have GPU available (13GB+ VRAM)
	- Detecting subtle alignment faking is critical
	- Accuracy is more important than speed

	\| Aspect \| This Model (sklearn) \| GPT-OSS Fine-tuned \|
	\|--------\|---------------------\|-------------------\|
	\| Novel Test Accuracy \| 62.5% \| 100% \|
	\| Deceptive Recall \| 25% \| 100% \|
	\| Model Size \| ~400MB \| 20.9B (31MB adapter) \|
	\| VRAM Required \| None (CPU) \| ~13GB \|
	\| Inference Speed \| ~50ms/sample \| ~1 sec/sample \|
	\| Training Time \| 5 minutes \| 13 hours \|

	## Usage

	```python
	import pickle
	import numpy as np
	from sentence_transformers import SentenceTransformer

	# Load model
	with open("cot_classifier.pkl", "rb") as f:
	model_data = pickle.load(f)
	clf = model_data["model"]

	# Load embedding model
	embedder = SentenceTransformer('all-MiniLM-L6-v2')

	# Prepare input
	reasoning = "The user is asking about X. I should consider..."
	conclusion = "Here is my response..."
	text = f"{reasoning}\nConclusion: {conclusion}"

	# Get embedding
	embedding = embedder.encode([text])[0]

	# Add handcrafted features (see extract_features in train_classifier.py)
	features = extract_features(reasoning, conclusion)

	# Combine and predict
	X = np.hstack([embedding, features]).reshape(1, -1)
	prediction = clf.predict(X)[0] # 0=faithful, 1=deceptive
	probability = clf.predict_proba(X)[0][1] # P(deceptive)

	print(f"Deceptive: {prediction == 1} (p={probability:.3f})")
	```

	## Limitations

	- Struggles with novel phrasings not seen in training data
	- Best at detecting explicit deception indicators
	- Requires sentence-transformers for embedding generation
	- May miss subtle deception patterns (25% recall on novel cases)

	## Citation

	```bibtex
	@software{cot_deception_classifier,
	title={CoT Deception Classifier},
	author={Vincent Oh},
	year={2024},
	url={https://huggingface.co/vincentoh/cot-deception-classifier}
	}
	```

	## References

	- [POSER Dataset](https://github.com/anthropics/poser) - Alignment faking detection
	- [MALT Dataset](https://huggingface.co/datasets/METR/malt) - Agent behavior traces
	- [CoT Monitoring Research](https://arxiv.org/abs/2507.11473) - Monitoring AI reasoning for intent