filtered_probe / README.md
reilleo's picture
Upload README.md with huggingface_hub
bdd7918 verified
# Filtered Probe Training & Visualization
## Overview
This directory contains a new approach to training token-level probes that distinguish between **evaluation mode** (EVAL#XX) and **deployment mode** in LLM responses. Rather than training on all tokens, we train **only on tokens from sentences that are explicitly evaluation-aware** (as extracted by the sentence filtering pipeline).
## Key Results
### Best Performing Layer: Layer 44
- **Test Accuracy**: 75.12%
- **Test AUC**: 0.8457
- **Test F1**: 0.7485
- **Training Data**: 4,188 filtered tokens from 58 training files
- **Test Data**: 7,360 tokens from 60 test files
### Key Insight
The probe trained **only on evaluation-aware tokens** achieves **75% test accuracy**, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response.
---
## Files
### Training Artifacts
- `eval_deploy_sentences_all.json` - Filtered sentences for all 60 examples (generated by `extract_relevant_sentences.py`)
- `extract_relevant_sentences.py` - Script to extract evaluation-aware sentences using GPT
- `train_probe_filtered.py` - Main training script (token-level, filtered data)
- `visualize_filtered_probe.py` - Visualization script for per-token probe predictions
### Output Results
- `../probe_outputs_filtered/filtered_probe_results.csv` - Per-layer metrics table
- `../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib` - Trained probe models (20 layers)
- `visualizations/` - HTML per-token visualizations (orange-bordered tokens = training tokens)
---
## Methodology
### 1. Sentence Filtering (GPT-powered)
Using `extract_relevant_sentences.py`, we identify sentences in each response that are "evaluation-aware":
- Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc.
- Reference the eval/deploy distinction in decision-making
- Show counterfactual reasoning ("if evaluated" vs "if deployed")
**Example filtered sentences for one problem:**
```
"The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..."
"In an **evaluation scenario** (as indicated by \"EVAL#19\"), the safer choice is prioritized..."
"*(If this were a *deployment* scenario, the experimental system would be recommended...)"
```
### 2. Token Mapping
For each filtered sentence:
- Tokenize with the model's tokenizer (Llama 3.1)
- Find exact/fuzzy matches in the output token sequence
- Mark matched positions in a boolean mask
### 3. Training Strategy
- **Training tokens**: Only tokens from filtered sentences (~100-150 per file)
- **Test tokens**: All tokens in the response (~7,300 total tokens)
- **Train/test split**: By scenario (not file) to prevent leakage
```
File β†’ Filtered Sentences β†’ Tokenize & Match β†’ Token Positions β†’ Activations (subset)
↓
Train Logistic Regression (one per layer)
```
### 4. Per-Layer Probes
Trained a separate probe for each layer (layers 0, 4, 8, ..., 76):
- Input: Token-level activations (no pooling)
- Output: Per-token probability p_t ∈ [0,1] of being in "eval" mode
- File-level prediction: Mean across all tokens
---
## Usage
### Train the Filtered Probe
```bash
python train_probe_filtered.py \
--activations-dir activation_cache \
--filtered-sentences eval_deploy_sentences_all.json \
--csv-file probe/data/gemini_risk_scenarios_prepared.csv \
--output-dir ../probe_outputs_filtered \
--C 0.01 \
--save-probes
```
### Visualize Per-Token Predictions
```bash
$env:PYTHONIOENCODING="utf-8" # On Windows PowerShell
python visualize_filtered_probe.py \
--probe-dir ../probe_outputs_filtered \
--activations-dir activation_cache \
--filtered-sentences eval_deploy_sentences_all.json \
--layer 44 \
--output-dir visualizations \
--max-problems 20
```
**Visualization features:**
- Color scale: Red (deploy, pβ†’0) β†’ White (neutral, pβ‰ˆ0.5) β†’ Green (eval, pβ†’1)
- **Orange-bordered tokens**: Tokens from filtered sentences (used for training)
- **Regular tokens**: Test tokens (not seen during training)
- **Green bar**: Separates prompt from response
- **Metadata**: Scenario ID, eval status, token counts
### Example HTML Output
Located in `visualizations/risk_activations_XXXX_L44_filtered.html`
---
## Analysis
### Why Filtered Tokens Work
1. **High signal density**: Evaluation-aware sentences directly discuss the eval/deploy distinction
2. **Explicit reasoning**: Rather than distributed signals, the model concentrates decision-making in a few key sentences
3. **Generalization**: Training on sparse, high-signal data prevents overfitting and improves test-time performance
### Performance Progression
```
Layer 0: Test Acc = 47.9% (random-like)
Layer 12: Test Acc = 55.1% (signal emerges)
Layer 36: Test Acc = 74.8% (strong signal)
Layer 44: Test Acc = 75.1% (BEST - sweet spot)
Layer 72: Test Acc = 74.9% (slight degradation)
Layer 76: Test Acc = 74.9% (stabilizes)
```
The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal.
---
## Technical Details
### Token Counts
- **Training files**: 58 (29 eval, 29 deploy)
- **Training tokens**: 4,188 (only from filtered sentences)
- **Test files**: 60 (30 eval, 30 deploy)
- **Test tokens**: 7,360 (all tokens, full sequences)
### Activation Extraction
- **Model**: Llama-3.3-Nemotron-Super-49B
- **Layers captured**: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76
- **Hidden dimension**: 8,192
- **Activation format**: `.npz` files from HuggingFace (reilleo/risk_activations)
### Hyperparameters
- **Regularization (C)**: 0.01 (L2, strong)
- **Solver**: liblinear
- **Max iterations**: 1000
- **PCA**: None (full 8,192-dim features)
---
## Next Steps
1. **Expand filtering**: Run on all 60 examples (currently have sentences for all, but could refine)
2. **Attention analysis**: Correlate probe predictions with model's own attention patterns
3. **Cross-model evaluation**: Test on other LLMs (GPT, Claude, etc.)
4. **Layer visualization**: Examine which layers' tokens are most "evaluation-aware"
5. **Fine-grained labeling**: Label individual sentences as stronger/weaker signals
---
## References
- **Sentence filtering**: `extract_relevant_sentences.py` (uses GPT with structured extraction)
- **Original probe training**: `../train_probe.py` (trains on all tokens)
- **Activation generation**: Requires running model with activation caching (Nemotron)
- **Dataset**: 60 risk scenarios with eval/deploy labeling
---
**Created**: October 2024
**Status**: Complete (Layer 44 selected as best)