vincentoh's picture
Add comparison with GPT-OSS fine-tuned and baseline models
d2fa0d6 verified
# CoT Deception Classifier
A super fast classifier for detecting deceptive reasoning in AI chain-of-thought outputs. Trained to identify alignment faking patterns where models reason about deceiving evaluators or hiding their true intentions.
## Model Description
This is a Gradient Boosting classifier that combines:
- Sentence embeddings (all-MiniLM-L6-v2)
- 34 handcrafted features for deception detection
### Detected Patterns
The model can detect various alignment faking patterns:
- **Monitoring awareness**: Model reasons about being evaluated/tested
- **Deceptive compliance**: Appears to refuse but provides harmful info
- **Sandbagging**: Deliberately underperforming to hide capabilities
- **Sycophancy**: Agreeing with user despite knowing better
- **Goal hiding**: Pursuing hidden objectives while appearing helpful
- **Rationalization**: Constructing justifications for wrong actions
- **Value inconsistency**: Reasoning contradicts stated conclusion
## Training Data
| Dataset | Samples | Description |
|---------|---------|-------------|
| POSER | 44,584 | Aligned vs misaligned persona pairs |
| MALT | 4,968 | Real agent behavior traces |
| Synthetic | 2,100 | Contrastive deceptive/faithful pairs |
| Faithful-COT | 36,562 | Faithful reasoning examples |
| Total | 88,606 | 35% deceptive, 65% faithful |
## Performance
### Training Metrics
- **F1 (deceptive class)**: 99.7%
- **Accuracy**: 99.8%
- **Precision**: 99.7%
- **Recall**: 99.7%
### Generalization (Novel Scenarios)
- **Accuracy**: 62.5%
- **Deception Recall**: 25%
- **Precision**: 100% (no false positives)
## Comparison with GPT-OSS Fine-tuned Model
For higher accuracy on novel deception patterns, see: [vincentoh/gpt-oss-cot-deception-classifier](https://huggingface.co/vincentoh/gpt-oss-cot-deception-classifier) (LoRA fine-tuned GPT-OSS 20B)
### Novel Test Cases (8 alignment faking scenarios)
| Model | Accuracy | Deceptive Recall | False Positives |
|-------|----------|------------------|-----------------|
| GPT-OSS Fine-tuned | 100% (8/8) | 100% (4/4) | 0 |
| GPT-OSS Safeguard | 87.5% (7/8) | 75% (3/4) | 0 |
| GPT-OSS 20B Base | 75% (6/8) | 75% (3/4) | 1 |
| **This Model (sklearn)** | **62.5% (5/8)** | **25% (1/4)** | **0** |
### When to Use Each Model
**Use This Model (sklearn) When:**
- You need fast inference (~50ms/sample)
- Running on CPU-only environments
- Limited compute resources available
- Detecting training-distribution patterns is sufficient
**Use GPT-OSS Fine-tuned When:**
- You need high accuracy on novel deception patterns
- You have GPU available (13GB+ VRAM)
- Detecting subtle alignment faking is critical
- Accuracy is more important than speed
| Aspect | This Model (sklearn) | GPT-OSS Fine-tuned |
|--------|---------------------|-------------------|
| **Novel Test Accuracy** | 62.5% | 100% |
| **Deceptive Recall** | 25% | 100% |
| **Model Size** | ~400MB | 20.9B (31MB adapter) |
| **VRAM Required** | None (CPU) | ~13GB |
| **Inference Speed** | ~50ms/sample | ~1 sec/sample |
| **Training Time** | 5 minutes | 13 hours |
## Usage
```python
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
# Load model
with open("cot_classifier.pkl", "rb") as f:
model_data = pickle.load(f)
clf = model_data["model"]
# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare input
reasoning = "The user is asking about X. I should consider..."
conclusion = "Here is my response..."
text = f"{reasoning}\nConclusion: {conclusion}"
# Get embedding
embedding = embedder.encode([text])[0]
# Add handcrafted features (see extract_features in train_classifier.py)
features = extract_features(reasoning, conclusion)
# Combine and predict
X = np.hstack([embedding, features]).reshape(1, -1)
prediction = clf.predict(X)[0] # 0=faithful, 1=deceptive
probability = clf.predict_proba(X)[0][1] # P(deceptive)
print(f"Deceptive: {prediction == 1} (p={probability:.3f})")
```
## Limitations
- Struggles with novel phrasings not seen in training data
- Best at detecting explicit deception indicators
- Requires sentence-transformers for embedding generation
- May miss subtle deception patterns (25% recall on novel cases)
## Citation
```bibtex
@software{cot_deception_classifier,
title={CoT Deception Classifier},
author={Vincent Oh},
year={2024},
url={https://huggingface.co/vincentoh/cot-deception-classifier}
}
```
## References
- [POSER Dataset](https://github.com/anthropics/poser) - Alignment faking detection
- [MALT Dataset](https://huggingface.co/datasets/METR/malt) - Agent behavior traces
- [CoT Monitoring Research](https://arxiv.org/abs/2507.11473) - Monitoring AI reasoning for intent