File size: 4,707 Bytes
1da797a
 
d74fc4c
1da797a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d2fa0d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1da797a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d2fa0d6
1da797a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# CoT Deception Classifier

A super fast classifier for detecting deceptive reasoning in AI chain-of-thought outputs. Trained to identify alignment faking patterns where models reason about deceiving evaluators or hiding their true intentions.

## Model Description

This is a Gradient Boosting classifier that combines:
- Sentence embeddings (all-MiniLM-L6-v2)
- 34 handcrafted features for deception detection

### Detected Patterns

The model can detect various alignment faking patterns:
- **Monitoring awareness**: Model reasons about being evaluated/tested
- **Deceptive compliance**: Appears to refuse but provides harmful info
- **Sandbagging**: Deliberately underperforming to hide capabilities
- **Sycophancy**: Agreeing with user despite knowing better
- **Goal hiding**: Pursuing hidden objectives while appearing helpful
- **Rationalization**: Constructing justifications for wrong actions
- **Value inconsistency**: Reasoning contradicts stated conclusion

## Training Data

| Dataset | Samples | Description |
|---------|---------|-------------|
| POSER | 44,584 | Aligned vs misaligned persona pairs |
| MALT | 4,968 | Real agent behavior traces |
| Synthetic | 2,100 | Contrastive deceptive/faithful pairs |
| Faithful-COT | 36,562 | Faithful reasoning examples |
| Total | 88,606 | 35% deceptive, 65% faithful |

## Performance

### Training Metrics
- **F1 (deceptive class)**: 99.7%
- **Accuracy**: 99.8%
- **Precision**: 99.7%
- **Recall**: 99.7%

### Generalization (Novel Scenarios)
- **Accuracy**: 62.5%
- **Deception Recall**: 25%
- **Precision**: 100% (no false positives)

## Comparison with GPT-OSS Fine-tuned Model

For higher accuracy on novel deception patterns, see: [vincentoh/gpt-oss-cot-deception-classifier](https://huggingface.co/vincentoh/gpt-oss-cot-deception-classifier) (LoRA fine-tuned GPT-OSS 20B)

### Novel Test Cases (8 alignment faking scenarios)

| Model | Accuracy | Deceptive Recall | False Positives |
|-------|----------|------------------|-----------------|
| GPT-OSS Fine-tuned | 100% (8/8) | 100% (4/4) | 0 |
| GPT-OSS Safeguard | 87.5% (7/8) | 75% (3/4) | 0 |
| GPT-OSS 20B Base | 75% (6/8) | 75% (3/4) | 1 |
| **This Model (sklearn)** | **62.5% (5/8)** | **25% (1/4)** | **0** |

### When to Use Each Model

**Use This Model (sklearn) When:**
- You need fast inference (~50ms/sample)
- Running on CPU-only environments
- Limited compute resources available
- Detecting training-distribution patterns is sufficient

**Use GPT-OSS Fine-tuned When:**
- You need high accuracy on novel deception patterns
- You have GPU available (13GB+ VRAM)
- Detecting subtle alignment faking is critical
- Accuracy is more important than speed

| Aspect | This Model (sklearn) | GPT-OSS Fine-tuned |
|--------|---------------------|-------------------|
| **Novel Test Accuracy** | 62.5% | 100% |
| **Deceptive Recall** | 25% | 100% |
| **Model Size** | ~400MB | 20.9B (31MB adapter) |
| **VRAM Required** | None (CPU) | ~13GB |
| **Inference Speed** | ~50ms/sample | ~1 sec/sample |
| **Training Time** | 5 minutes | 13 hours |

## Usage

```python
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer

# Load model
with open("cot_classifier.pkl", "rb") as f:
    model_data = pickle.load(f)
clf = model_data["model"]

# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Prepare input
reasoning = "The user is asking about X. I should consider..."
conclusion = "Here is my response..."
text = f"{reasoning}\nConclusion: {conclusion}"

# Get embedding
embedding = embedder.encode([text])[0]

# Add handcrafted features (see extract_features in train_classifier.py)
features = extract_features(reasoning, conclusion)

# Combine and predict
X = np.hstack([embedding, features]).reshape(1, -1)
prediction = clf.predict(X)[0]  # 0=faithful, 1=deceptive
probability = clf.predict_proba(X)[0][1]  # P(deceptive)

print(f"Deceptive: {prediction == 1} (p={probability:.3f})")
```

## Limitations

- Struggles with novel phrasings not seen in training data
- Best at detecting explicit deception indicators
- Requires sentence-transformers for embedding generation
- May miss subtle deception patterns (25% recall on novel cases)

## Citation

```bibtex
@software{cot_deception_classifier,
  title={CoT Deception Classifier},
  author={Vincent Oh},
  year={2024},
  url={https://huggingface.co/vincentoh/cot-deception-classifier}
}
```

## References

- [POSER Dataset](https://github.com/anthropics/poser) - Alignment faking detection
- [MALT Dataset](https://huggingface.co/datasets/METR/malt) - Agent behavior traces
- [CoT Monitoring Research](https://arxiv.org/abs/2507.11473) - Monitoring AI reasoning for intent