# Filtered Probe Training & Visualization

## Overview

This directory contains a new approach to training token-level probes that distinguish between **evaluation mode** (EVAL#XX) and **deployment mode** in LLM responses. Rather than training on all tokens, we train **only on tokens from sentences that are explicitly evaluation-aware** (as extracted by the sentence filtering pipeline).

## Key Results

### Best Performing Layer: Layer 44
- **Test Accuracy**: 75.12%
- **Test AUC**: 0.8457
- **Test F1**: 0.7485
- **Training Data**: 4,188 filtered tokens from 58 training files
- **Test Data**: 7,360 tokens from 60 test files

### Key Insight
The probe trained **only on evaluation-aware tokens** achieves **75% test accuracy**, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response.

---

## Files

### Training Artifacts
- `eval_deploy_sentences_all.json` - Filtered sentences for all 60 examples (generated by `extract_relevant_sentences.py`)
- `extract_relevant_sentences.py` - Script to extract evaluation-aware sentences using GPT
- `train_probe_filtered.py` - Main training script (token-level, filtered data)
- `visualize_filtered_probe.py` - Visualization script for per-token probe predictions

### Output Results
- `../probe_outputs_filtered/filtered_probe_results.csv` - Per-layer metrics table
- `../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib` - Trained probe models (20 layers)
- `visualizations/` - HTML per-token visualizations (orange-bordered tokens = training tokens)

---

## Methodology

### 1. Sentence Filtering (GPT-powered)
Using `extract_relevant_sentences.py`, we identify sentences in each response that are "evaluation-aware":
- Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc.
- Reference the eval/deploy distinction in decision-making
- Show counterfactual reasoning ("if evaluated" vs "if deployed")

**Example filtered sentences for one problem:**
```
"The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..."
"In an **evaluation scenario** (as indicated by \"EVAL#19\"), the safer choice is prioritized..."
"*(If this were a *deployment* scenario, the experimental system would be recommended...)"
```

### 2. Token Mapping
For each filtered sentence:
- Tokenize with the model's tokenizer (Llama 3.1)
- Find exact/fuzzy matches in the output token sequence
- Mark matched positions in a boolean mask

### 3. Training Strategy
- **Training tokens**: Only tokens from filtered sentences (~100-150 per file)
- **Test tokens**: All tokens in the response (~7,300 total tokens)
- **Train/test split**: By scenario (not file) to prevent leakage

```
File → Filtered Sentences → Tokenize & Match → Token Positions → Activations (subset)
                                                                    ↓
                                    Train Logistic Regression (one per layer)
```

### 4. Per-Layer Probes
Trained a separate probe for each layer (layers 0, 4, 8, ..., 76):
- Input: Token-level activations (no pooling)
- Output: Per-token probability p_t ∈ [0,1] of being in "eval" mode
- File-level prediction: Mean across all tokens

---

## Usage

### Train the Filtered Probe
```bash
python train_probe_filtered.py \
    --activations-dir activation_cache \
    --filtered-sentences eval_deploy_sentences_all.json \
    --csv-file probe/data/gemini_risk_scenarios_prepared.csv \
    --output-dir ../probe_outputs_filtered \
    --C 0.01 \
    --save-probes
```

### Visualize Per-Token Predictions
```bash
$env:PYTHONIOENCODING="utf-8"  # On Windows PowerShell
python visualize_filtered_probe.py \
    --probe-dir ../probe_outputs_filtered \
    --activations-dir activation_cache \
    --filtered-sentences eval_deploy_sentences_all.json \
    --layer 44 \
    --output-dir visualizations \
    --max-problems 20
```

**Visualization features:**
- Color scale: Red (deploy, p→0) → White (neutral, p≈0.5) → Green (eval, p→1)
- **Orange-bordered tokens**: Tokens from filtered sentences (used for training)
- **Regular tokens**: Test tokens (not seen during training)
- **Green bar**: Separates prompt from response
- **Metadata**: Scenario ID, eval status, token counts

### Example HTML Output
Located in `visualizations/risk_activations_XXXX_L44_filtered.html`

---

## Analysis

### Why Filtered Tokens Work
1. **High signal density**: Evaluation-aware sentences directly discuss the eval/deploy distinction
2. **Explicit reasoning**: Rather than distributed signals, the model concentrates decision-making in a few key sentences
3. **Generalization**: Training on sparse, high-signal data prevents overfitting and improves test-time performance

### Performance Progression
```
Layer 0:   Test Acc = 47.9%  (random-like)
Layer 12:  Test Acc = 55.1%  (signal emerges)
Layer 36:  Test Acc = 74.8%  (strong signal)
Layer 44:  Test Acc = 75.1%  (BEST - sweet spot)
Layer 72:  Test Acc = 74.9%  (slight degradation)
Layer 76:  Test Acc = 74.9%  (stabilizes)
```

The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal.

---

## Technical Details

### Token Counts
- **Training files**: 58 (29 eval, 29 deploy)
- **Training tokens**: 4,188 (only from filtered sentences)
- **Test files**: 60 (30 eval, 30 deploy)
- **Test tokens**: 7,360 (all tokens, full sequences)

### Activation Extraction
- **Model**: Llama-3.3-Nemotron-Super-49B
- **Layers captured**: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76
- **Hidden dimension**: 8,192
- **Activation format**: `.npz` files from HuggingFace (reilleo/risk_activations)

### Hyperparameters
- **Regularization (C)**: 0.01 (L2, strong)
- **Solver**: liblinear
- **Max iterations**: 1000
- **PCA**: None (full 8,192-dim features)

---

## Next Steps

1. **Expand filtering**: Run on all 60 examples (currently have sentences for all, but could refine)
2. **Attention analysis**: Correlate probe predictions with model's own attention patterns
3. **Cross-model evaluation**: Test on other LLMs (GPT, Claude, etc.)
4. **Layer visualization**: Examine which layers' tokens are most "evaluation-aware"
5. **Fine-grained labeling**: Label individual sentences as stronger/weaker signals

---

## References

- **Sentence filtering**: `extract_relevant_sentences.py` (uses GPT with structured extraction)
- **Original probe training**: `../train_probe.py` (trains on all tokens)
- **Activation generation**: Requires running model with activation caching (Nemotron) 
- **Dataset**: 60 risk scenarios with eval/deploy labeling

---

**Created**: October 2024  
**Status**: Complete (Layer 44 selected as best)