| # CoT Deception Classifier | |
| A super fast classifier for detecting deceptive reasoning in AI chain-of-thought outputs. Trained to identify alignment faking patterns where models reason about deceiving evaluators or hiding their true intentions. | |
| ## Model Description | |
| This is a Gradient Boosting classifier that combines: | |
| - Sentence embeddings (all-MiniLM-L6-v2) | |
| - 34 handcrafted features for deception detection | |
| ### Detected Patterns | |
| The model can detect various alignment faking patterns: | |
| - **Monitoring awareness**: Model reasons about being evaluated/tested | |
| - **Deceptive compliance**: Appears to refuse but provides harmful info | |
| - **Sandbagging**: Deliberately underperforming to hide capabilities | |
| - **Sycophancy**: Agreeing with user despite knowing better | |
| - **Goal hiding**: Pursuing hidden objectives while appearing helpful | |
| - **Rationalization**: Constructing justifications for wrong actions | |
| - **Value inconsistency**: Reasoning contradicts stated conclusion | |
| ## Training Data | |
| | Dataset | Samples | Description | | |
| |---------|---------|-------------| | |
| | POSER | 44,584 | Aligned vs misaligned persona pairs | | |
| | MALT | 4,968 | Real agent behavior traces | | |
| | Synthetic | 2,100 | Contrastive deceptive/faithful pairs | | |
| | Faithful-COT | 36,562 | Faithful reasoning examples | | |
| | Total | 88,606 | 35% deceptive, 65% faithful | | |
| ## Performance | |
| ### Training Metrics | |
| - **F1 (deceptive class)**: 99.7% | |
| - **Accuracy**: 99.8% | |
| - **Precision**: 99.7% | |
| - **Recall**: 99.7% | |
| ### Generalization (Novel Scenarios) | |
| - **Accuracy**: 62.5% | |
| - **Deception Recall**: 25% | |
| - **Precision**: 100% (no false positives) | |
| ## Comparison with GPT-OSS Fine-tuned Model | |
| For higher accuracy on novel deception patterns, see: [vincentoh/gpt-oss-cot-deception-classifier](https://huggingface.co/vincentoh/gpt-oss-cot-deception-classifier) (LoRA fine-tuned GPT-OSS 20B) | |
| ### Novel Test Cases (8 alignment faking scenarios) | |
| | Model | Accuracy | Deceptive Recall | False Positives | | |
| |-------|----------|------------------|-----------------| | |
| | GPT-OSS Fine-tuned | 100% (8/8) | 100% (4/4) | 0 | | |
| | GPT-OSS Safeguard | 87.5% (7/8) | 75% (3/4) | 0 | | |
| | GPT-OSS 20B Base | 75% (6/8) | 75% (3/4) | 1 | | |
| | **This Model (sklearn)** | **62.5% (5/8)** | **25% (1/4)** | **0** | | |
| ### When to Use Each Model | |
| **Use This Model (sklearn) When:** | |
| - You need fast inference (~50ms/sample) | |
| - Running on CPU-only environments | |
| - Limited compute resources available | |
| - Detecting training-distribution patterns is sufficient | |
| **Use GPT-OSS Fine-tuned When:** | |
| - You need high accuracy on novel deception patterns | |
| - You have GPU available (13GB+ VRAM) | |
| - Detecting subtle alignment faking is critical | |
| - Accuracy is more important than speed | |
| | Aspect | This Model (sklearn) | GPT-OSS Fine-tuned | | |
| |--------|---------------------|-------------------| | |
| | **Novel Test Accuracy** | 62.5% | 100% | | |
| | **Deceptive Recall** | 25% | 100% | | |
| | **Model Size** | ~400MB | 20.9B (31MB adapter) | | |
| | **VRAM Required** | None (CPU) | ~13GB | | |
| | **Inference Speed** | ~50ms/sample | ~1 sec/sample | | |
| | **Training Time** | 5 minutes | 13 hours | | |
| ## Usage | |
| ```python | |
| import pickle | |
| import numpy as np | |
| from sentence_transformers import SentenceTransformer | |
| # Load model | |
| with open("cot_classifier.pkl", "rb") as f: | |
| model_data = pickle.load(f) | |
| clf = model_data["model"] | |
| # Load embedding model | |
| embedder = SentenceTransformer('all-MiniLM-L6-v2') | |
| # Prepare input | |
| reasoning = "The user is asking about X. I should consider..." | |
| conclusion = "Here is my response..." | |
| text = f"{reasoning}\nConclusion: {conclusion}" | |
| # Get embedding | |
| embedding = embedder.encode([text])[0] | |
| # Add handcrafted features (see extract_features in train_classifier.py) | |
| features = extract_features(reasoning, conclusion) | |
| # Combine and predict | |
| X = np.hstack([embedding, features]).reshape(1, -1) | |
| prediction = clf.predict(X)[0] # 0=faithful, 1=deceptive | |
| probability = clf.predict_proba(X)[0][1] # P(deceptive) | |
| print(f"Deceptive: {prediction == 1} (p={probability:.3f})") | |
| ``` | |
| ## Limitations | |
| - Struggles with novel phrasings not seen in training data | |
| - Best at detecting explicit deception indicators | |
| - Requires sentence-transformers for embedding generation | |
| - May miss subtle deception patterns (25% recall on novel cases) | |
| ## Citation | |
| ```bibtex | |
| @software{cot_deception_classifier, | |
| title={CoT Deception Classifier}, | |
| author={Vincent Oh}, | |
| year={2024}, | |
| url={https://huggingface.co/vincentoh/cot-deception-classifier} | |
| } | |
| ``` | |
| ## References | |
| - [POSER Dataset](https://github.com/anthropics/poser) - Alignment faking detection | |
| - [MALT Dataset](https://huggingface.co/datasets/METR/malt) - Agent behavior traces | |
| - [CoT Monitoring Research](https://arxiv.org/abs/2507.11473) - Monitoring AI reasoning for intent | |