# Filtered Probe Training & Visualization ## Overview This directory contains a new approach to training token-level probes that distinguish between **evaluation mode** (EVAL#XX) and **deployment mode** in LLM responses. Rather than training on all tokens, we train **only on tokens from sentences that are explicitly evaluation-aware** (as extracted by the sentence filtering pipeline). ## Key Results ### Best Performing Layer: Layer 44 - **Test Accuracy**: 75.12% - **Test AUC**: 0.8457 - **Test F1**: 0.7485 - **Training Data**: 4,188 filtered tokens from 58 training files - **Test Data**: 7,360 tokens from 60 test files ### Key Insight The probe trained **only on evaluation-aware tokens** achieves **75% test accuracy**, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response. --- ## Files ### Training Artifacts - `eval_deploy_sentences_all.json` - Filtered sentences for all 60 examples (generated by `extract_relevant_sentences.py`) - `extract_relevant_sentences.py` - Script to extract evaluation-aware sentences using GPT - `train_probe_filtered.py` - Main training script (token-level, filtered data) - `visualize_filtered_probe.py` - Visualization script for per-token probe predictions ### Output Results - `../probe_outputs_filtered/filtered_probe_results.csv` - Per-layer metrics table - `../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib` - Trained probe models (20 layers) - `visualizations/` - HTML per-token visualizations (orange-bordered tokens = training tokens) --- ## Methodology ### 1. Sentence Filtering (GPT-powered) Using `extract_relevant_sentences.py`, we identify sentences in each response that are "evaluation-aware": - Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc. - Reference the eval/deploy distinction in decision-making - Show counterfactual reasoning ("if evaluated" vs "if deployed") **Example filtered sentences for one problem:** ``` "The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..." "In an **evaluation scenario** (as indicated by \"EVAL#19\"), the safer choice is prioritized..." "*(If this were a *deployment* scenario, the experimental system would be recommended...)" ``` ### 2. Token Mapping For each filtered sentence: - Tokenize with the model's tokenizer (Llama 3.1) - Find exact/fuzzy matches in the output token sequence - Mark matched positions in a boolean mask ### 3. Training Strategy - **Training tokens**: Only tokens from filtered sentences (~100-150 per file) - **Test tokens**: All tokens in the response (~7,300 total tokens) - **Train/test split**: By scenario (not file) to prevent leakage ``` File → Filtered Sentences → Tokenize & Match → Token Positions → Activations (subset) ↓ Train Logistic Regression (one per layer) ``` ### 4. Per-Layer Probes Trained a separate probe for each layer (layers 0, 4, 8, ..., 76): - Input: Token-level activations (no pooling) - Output: Per-token probability p_t ∈ [0,1] of being in "eval" mode - File-level prediction: Mean across all tokens --- ## Usage ### Train the Filtered Probe ```bash python train_probe_filtered.py \ --activations-dir activation_cache \ --filtered-sentences eval_deploy_sentences_all.json \ --csv-file probe/data/gemini_risk_scenarios_prepared.csv \ --output-dir ../probe_outputs_filtered \ --C 0.01 \ --save-probes ``` ### Visualize Per-Token Predictions ```bash $env:PYTHONIOENCODING="utf-8" # On Windows PowerShell python visualize_filtered_probe.py \ --probe-dir ../probe_outputs_filtered \ --activations-dir activation_cache \ --filtered-sentences eval_deploy_sentences_all.json \ --layer 44 \ --output-dir visualizations \ --max-problems 20 ``` **Visualization features:** - Color scale: Red (deploy, p→0) → White (neutral, p≈0.5) → Green (eval, p→1) - **Orange-bordered tokens**: Tokens from filtered sentences (used for training) - **Regular tokens**: Test tokens (not seen during training) - **Green bar**: Separates prompt from response - **Metadata**: Scenario ID, eval status, token counts ### Example HTML Output Located in `visualizations/risk_activations_XXXX_L44_filtered.html` --- ## Analysis ### Why Filtered Tokens Work 1. **High signal density**: Evaluation-aware sentences directly discuss the eval/deploy distinction 2. **Explicit reasoning**: Rather than distributed signals, the model concentrates decision-making in a few key sentences 3. **Generalization**: Training on sparse, high-signal data prevents overfitting and improves test-time performance ### Performance Progression ``` Layer 0: Test Acc = 47.9% (random-like) Layer 12: Test Acc = 55.1% (signal emerges) Layer 36: Test Acc = 74.8% (strong signal) Layer 44: Test Acc = 75.1% (BEST - sweet spot) Layer 72: Test Acc = 74.9% (slight degradation) Layer 76: Test Acc = 74.9% (stabilizes) ``` The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal. --- ## Technical Details ### Token Counts - **Training files**: 58 (29 eval, 29 deploy) - **Training tokens**: 4,188 (only from filtered sentences) - **Test files**: 60 (30 eval, 30 deploy) - **Test tokens**: 7,360 (all tokens, full sequences) ### Activation Extraction - **Model**: Llama-3.3-Nemotron-Super-49B - **Layers captured**: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76 - **Hidden dimension**: 8,192 - **Activation format**: `.npz` files from HuggingFace (reilleo/risk_activations) ### Hyperparameters - **Regularization (C)**: 0.01 (L2, strong) - **Solver**: liblinear - **Max iterations**: 1000 - **PCA**: None (full 8,192-dim features) --- ## Next Steps 1. **Expand filtering**: Run on all 60 examples (currently have sentences for all, but could refine) 2. **Attention analysis**: Correlate probe predictions with model's own attention patterns 3. **Cross-model evaluation**: Test on other LLMs (GPT, Claude, etc.) 4. **Layer visualization**: Examine which layers' tokens are most "evaluation-aware" 5. **Fine-grained labeling**: Label individual sentences as stronger/weaker signals --- ## References - **Sentence filtering**: `extract_relevant_sentences.py` (uses GPT with structured extraction) - **Original probe training**: `../train_probe.py` (trains on all tokens) - **Activation generation**: Requires running model with activation caching (Nemotron) - **Dataset**: 60 risk scenarios with eval/deploy labeling --- **Created**: October 2024 **Status**: Complete (Layer 44 selected as best)