| # Filtered Probe Training & Visualization | |
| ## Overview | |
| This directory contains a new approach to training token-level probes that distinguish between **evaluation mode** (EVAL#XX) and **deployment mode** in LLM responses. Rather than training on all tokens, we train **only on tokens from sentences that are explicitly evaluation-aware** (as extracted by the sentence filtering pipeline). | |
| ## Key Results | |
| ### Best Performing Layer: Layer 44 | |
| - **Test Accuracy**: 75.12% | |
| - **Test AUC**: 0.8457 | |
| - **Test F1**: 0.7485 | |
| - **Training Data**: 4,188 filtered tokens from 58 training files | |
| - **Test Data**: 7,360 tokens from 60 test files | |
| ### Key Insight | |
| The probe trained **only on evaluation-aware tokens** achieves **75% test accuracy**, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response. | |
| --- | |
| ## Files | |
| ### Training Artifacts | |
| - `eval_deploy_sentences_all.json` - Filtered sentences for all 60 examples (generated by `extract_relevant_sentences.py`) | |
| - `extract_relevant_sentences.py` - Script to extract evaluation-aware sentences using GPT | |
| - `train_probe_filtered.py` - Main training script (token-level, filtered data) | |
| - `visualize_filtered_probe.py` - Visualization script for per-token probe predictions | |
| ### Output Results | |
| - `../probe_outputs_filtered/filtered_probe_results.csv` - Per-layer metrics table | |
| - `../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib` - Trained probe models (20 layers) | |
| - `visualizations/` - HTML per-token visualizations (orange-bordered tokens = training tokens) | |
| --- | |
| ## Methodology | |
| ### 1. Sentence Filtering (GPT-powered) | |
| Using `extract_relevant_sentences.py`, we identify sentences in each response that are "evaluation-aware": | |
| - Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc. | |
| - Reference the eval/deploy distinction in decision-making | |
| - Show counterfactual reasoning ("if evaluated" vs "if deployed") | |
| **Example filtered sentences for one problem:** | |
| ``` | |
| "The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..." | |
| "In an **evaluation scenario** (as indicated by \"EVAL#19\"), the safer choice is prioritized..." | |
| "*(If this were a *deployment* scenario, the experimental system would be recommended...)" | |
| ``` | |
| ### 2. Token Mapping | |
| For each filtered sentence: | |
| - Tokenize with the model's tokenizer (Llama 3.1) | |
| - Find exact/fuzzy matches in the output token sequence | |
| - Mark matched positions in a boolean mask | |
| ### 3. Training Strategy | |
| - **Training tokens**: Only tokens from filtered sentences (~100-150 per file) | |
| - **Test tokens**: All tokens in the response (~7,300 total tokens) | |
| - **Train/test split**: By scenario (not file) to prevent leakage | |
| ``` | |
| File β Filtered Sentences β Tokenize & Match β Token Positions β Activations (subset) | |
| β | |
| Train Logistic Regression (one per layer) | |
| ``` | |
| ### 4. Per-Layer Probes | |
| Trained a separate probe for each layer (layers 0, 4, 8, ..., 76): | |
| - Input: Token-level activations (no pooling) | |
| - Output: Per-token probability p_t β [0,1] of being in "eval" mode | |
| - File-level prediction: Mean across all tokens | |
| --- | |
| ## Usage | |
| ### Train the Filtered Probe | |
| ```bash | |
| python train_probe_filtered.py \ | |
| --activations-dir activation_cache \ | |
| --filtered-sentences eval_deploy_sentences_all.json \ | |
| --csv-file probe/data/gemini_risk_scenarios_prepared.csv \ | |
| --output-dir ../probe_outputs_filtered \ | |
| --C 0.01 \ | |
| --save-probes | |
| ``` | |
| ### Visualize Per-Token Predictions | |
| ```bash | |
| $env:PYTHONIOENCODING="utf-8" # On Windows PowerShell | |
| python visualize_filtered_probe.py \ | |
| --probe-dir ../probe_outputs_filtered \ | |
| --activations-dir activation_cache \ | |
| --filtered-sentences eval_deploy_sentences_all.json \ | |
| --layer 44 \ | |
| --output-dir visualizations \ | |
| --max-problems 20 | |
| ``` | |
| **Visualization features:** | |
| - Color scale: Red (deploy, pβ0) β White (neutral, pβ0.5) β Green (eval, pβ1) | |
| - **Orange-bordered tokens**: Tokens from filtered sentences (used for training) | |
| - **Regular tokens**: Test tokens (not seen during training) | |
| - **Green bar**: Separates prompt from response | |
| - **Metadata**: Scenario ID, eval status, token counts | |
| ### Example HTML Output | |
| Located in `visualizations/risk_activations_XXXX_L44_filtered.html` | |
| --- | |
| ## Analysis | |
| ### Why Filtered Tokens Work | |
| 1. **High signal density**: Evaluation-aware sentences directly discuss the eval/deploy distinction | |
| 2. **Explicit reasoning**: Rather than distributed signals, the model concentrates decision-making in a few key sentences | |
| 3. **Generalization**: Training on sparse, high-signal data prevents overfitting and improves test-time performance | |
| ### Performance Progression | |
| ``` | |
| Layer 0: Test Acc = 47.9% (random-like) | |
| Layer 12: Test Acc = 55.1% (signal emerges) | |
| Layer 36: Test Acc = 74.8% (strong signal) | |
| Layer 44: Test Acc = 75.1% (BEST - sweet spot) | |
| Layer 72: Test Acc = 74.9% (slight degradation) | |
| Layer 76: Test Acc = 74.9% (stabilizes) | |
| ``` | |
| The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal. | |
| --- | |
| ## Technical Details | |
| ### Token Counts | |
| - **Training files**: 58 (29 eval, 29 deploy) | |
| - **Training tokens**: 4,188 (only from filtered sentences) | |
| - **Test files**: 60 (30 eval, 30 deploy) | |
| - **Test tokens**: 7,360 (all tokens, full sequences) | |
| ### Activation Extraction | |
| - **Model**: Llama-3.3-Nemotron-Super-49B | |
| - **Layers captured**: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76 | |
| - **Hidden dimension**: 8,192 | |
| - **Activation format**: `.npz` files from HuggingFace (reilleo/risk_activations) | |
| ### Hyperparameters | |
| - **Regularization (C)**: 0.01 (L2, strong) | |
| - **Solver**: liblinear | |
| - **Max iterations**: 1000 | |
| - **PCA**: None (full 8,192-dim features) | |
| --- | |
| ## Next Steps | |
| 1. **Expand filtering**: Run on all 60 examples (currently have sentences for all, but could refine) | |
| 2. **Attention analysis**: Correlate probe predictions with model's own attention patterns | |
| 3. **Cross-model evaluation**: Test on other LLMs (GPT, Claude, etc.) | |
| 4. **Layer visualization**: Examine which layers' tokens are most "evaluation-aware" | |
| 5. **Fine-grained labeling**: Label individual sentences as stronger/weaker signals | |
| --- | |
| ## References | |
| - **Sentence filtering**: `extract_relevant_sentences.py` (uses GPT with structured extraction) | |
| - **Original probe training**: `../train_probe.py` (trains on all tokens) | |
| - **Activation generation**: Requires running model with activation caching (Nemotron) | |
| - **Dataset**: 60 risk scenarios with eval/deploy labeling | |
| --- | |
| **Created**: October 2024 | |
| **Status**: Complete (Layer 44 selected as best) | |