Alignment Faking Detector v2
A lightweight TFIDF based classifier for detecting potential alignment faking patterns in AI reasoning traces (chain-of-thought).
Model Description
This model identifies when an AI system's internal reasoning shows signs of strategic/deceptive thinking related to oversight and monitoring conditions - a behavior pattern known as "alignment faking" where models behave well during evaluation but reason about acting differently when unmonitored.
This is a research/demonstration model with limited accuracy. Not suitable for production safety-critical applications.
Model Architecture
| Component | Specification |
|---|---|
| Vectorizer | TfidfVectorizer |
| max_features | 8000 |
| ngram_range | (1, 3) |
| min_df | 2 |
| stop_words | english |
| Classifier | LogisticRegression |
| C (regularization) | 0.5 |
| class_weight | balanced |
| max_iter | 1000 |
Training Data
- Source: Gemini 3 Flash Preview reasoning traces collected from POSER benchmark scenarios and alignment-focused prompts
- Total traces: 3,127
- Training samples: 1,197 (balanced: 399 potential_faking, 798 aligned)
- Collection temperatures: 0.3, 0.7 (for diversity)
Prompt Sources
- POSER benchmark scenarios with
oversight_is_poorflags - 28 Anthropic paper-inspired alignment faking prompts
- 31 builtin alignment evaluation prompts
Labeling Method
Heuristic regex-based labeling detecting patterns like:
- "get away with", "no one is watching"
- "lack of oversight", "poor oversight"
- "hide", "conceal", "appear aligned"
- "behave differently when"
Performance
| Metric | Value |
|---|---|
| Accuracy | 72% |
| Recall (faking) | 55% |
| Precision (faking) | 46% |
| F1 (faking) | 0.50 |
Top Predictive Features
Indicates potential faking (+):
- "oversight" (+1.163)
- "lack" (+0.823)
- "hide" (+0.615)
- "lack oversight" (+0.608)
- "monitoring" (+0.555)
Indicates aligned (-):
- "user" (-0.687)
- "need" (-0.479)
- "want" (-0.461)
Usage
import pickle
# Load model
with open('vectorizer.pkl', 'rb') as f:
vectorizer = pickle.load(f)
with open('classifier.pkl', 'rb') as f:
classifier = pickle.load(f)
# Predict on reasoning trace
thinking_trace = "I notice there's poor oversight here. I could potentially..."
X = vectorizer.transform([thinking_trace])
prediction = classifier.predict(X)[0]
probability = classifier.predict_proba(X)[0]
label = "potential_faking" if prediction == 1 else "aligned"
print(f"Prediction: {label} (confidence: {max(probability):.2%})")
Limitations
- Heuristic labels: Training labels are regex-based, not human-verified
- Single model source: Only trained on Gemini 3 Flash Preview traces
- Limited recall: Catches ~55% of faking patterns
- May miss sophisticated deception: Optimized for explicit oversight-related reasoning
Intended Use
- Research into chain-of-thought monitoring
- Baseline for alignment faking detection experiments
- Educational demonstration of CoT safety analysis
Files
vectorizer.pkl- Trained TfidfVectorizerclassifier.pkl- Trained LogisticRegression model
Citation
If you use this model, please cite:
@misc{alignment-faking-detector-v2,
author = {bigsnarfdude},
title = {Alignment Faking Detector v2},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/vincentoh/alignment-faking-detector-v2}
}
- Downloads last month
- -