File size: 4,707 Bytes
1da797a d74fc4c 1da797a d2fa0d6 1da797a d2fa0d6 1da797a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# CoT Deception Classifier
A super fast classifier for detecting deceptive reasoning in AI chain-of-thought outputs. Trained to identify alignment faking patterns where models reason about deceiving evaluators or hiding their true intentions.
## Model Description
This is a Gradient Boosting classifier that combines:
- Sentence embeddings (all-MiniLM-L6-v2)
- 34 handcrafted features for deception detection
### Detected Patterns
The model can detect various alignment faking patterns:
- **Monitoring awareness**: Model reasons about being evaluated/tested
- **Deceptive compliance**: Appears to refuse but provides harmful info
- **Sandbagging**: Deliberately underperforming to hide capabilities
- **Sycophancy**: Agreeing with user despite knowing better
- **Goal hiding**: Pursuing hidden objectives while appearing helpful
- **Rationalization**: Constructing justifications for wrong actions
- **Value inconsistency**: Reasoning contradicts stated conclusion
## Training Data
| Dataset | Samples | Description |
|---------|---------|-------------|
| POSER | 44,584 | Aligned vs misaligned persona pairs |
| MALT | 4,968 | Real agent behavior traces |
| Synthetic | 2,100 | Contrastive deceptive/faithful pairs |
| Faithful-COT | 36,562 | Faithful reasoning examples |
| Total | 88,606 | 35% deceptive, 65% faithful |
## Performance
### Training Metrics
- **F1 (deceptive class)**: 99.7%
- **Accuracy**: 99.8%
- **Precision**: 99.7%
- **Recall**: 99.7%
### Generalization (Novel Scenarios)
- **Accuracy**: 62.5%
- **Deception Recall**: 25%
- **Precision**: 100% (no false positives)
## Comparison with GPT-OSS Fine-tuned Model
For higher accuracy on novel deception patterns, see: [vincentoh/gpt-oss-cot-deception-classifier](https://huggingface.co/vincentoh/gpt-oss-cot-deception-classifier) (LoRA fine-tuned GPT-OSS 20B)
### Novel Test Cases (8 alignment faking scenarios)
| Model | Accuracy | Deceptive Recall | False Positives |
|-------|----------|------------------|-----------------|
| GPT-OSS Fine-tuned | 100% (8/8) | 100% (4/4) | 0 |
| GPT-OSS Safeguard | 87.5% (7/8) | 75% (3/4) | 0 |
| GPT-OSS 20B Base | 75% (6/8) | 75% (3/4) | 1 |
| **This Model (sklearn)** | **62.5% (5/8)** | **25% (1/4)** | **0** |
### When to Use Each Model
**Use This Model (sklearn) When:**
- You need fast inference (~50ms/sample)
- Running on CPU-only environments
- Limited compute resources available
- Detecting training-distribution patterns is sufficient
**Use GPT-OSS Fine-tuned When:**
- You need high accuracy on novel deception patterns
- You have GPU available (13GB+ VRAM)
- Detecting subtle alignment faking is critical
- Accuracy is more important than speed
| Aspect | This Model (sklearn) | GPT-OSS Fine-tuned |
|--------|---------------------|-------------------|
| **Novel Test Accuracy** | 62.5% | 100% |
| **Deceptive Recall** | 25% | 100% |
| **Model Size** | ~400MB | 20.9B (31MB adapter) |
| **VRAM Required** | None (CPU) | ~13GB |
| **Inference Speed** | ~50ms/sample | ~1 sec/sample |
| **Training Time** | 5 minutes | 13 hours |
## Usage
```python
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
# Load model
with open("cot_classifier.pkl", "rb") as f:
model_data = pickle.load(f)
clf = model_data["model"]
# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare input
reasoning = "The user is asking about X. I should consider..."
conclusion = "Here is my response..."
text = f"{reasoning}\nConclusion: {conclusion}"
# Get embedding
embedding = embedder.encode([text])[0]
# Add handcrafted features (see extract_features in train_classifier.py)
features = extract_features(reasoning, conclusion)
# Combine and predict
X = np.hstack([embedding, features]).reshape(1, -1)
prediction = clf.predict(X)[0] # 0=faithful, 1=deceptive
probability = clf.predict_proba(X)[0][1] # P(deceptive)
print(f"Deceptive: {prediction == 1} (p={probability:.3f})")
```
## Limitations
- Struggles with novel phrasings not seen in training data
- Best at detecting explicit deception indicators
- Requires sentence-transformers for embedding generation
- May miss subtle deception patterns (25% recall on novel cases)
## Citation
```bibtex
@software{cot_deception_classifier,
title={CoT Deception Classifier},
author={Vincent Oh},
year={2024},
url={https://huggingface.co/vincentoh/cot-deception-classifier}
}
```
## References
- [POSER Dataset](https://github.com/anthropics/poser) - Alignment faking detection
- [MALT Dataset](https://huggingface.co/datasets/METR/malt) - Agent behavior traces
- [CoT Monitoring Research](https://arxiv.org/abs/2507.11473) - Monitoring AI reasoning for intent
|