filtered_probe / README.md

Upload README.md with huggingface_hub

bdd7918 verified 4 months ago

6.93 kB

	# Filtered Probe Training & Visualization

	## Overview

	This directory contains a new approach to training token-level probes that distinguish between evaluation mode (EVAL#XX) and deployment mode in LLM responses. Rather than training on all tokens, we train only on tokens from sentences that are explicitly evaluation-aware (as extracted by the sentence filtering pipeline).

	## Key Results

	### Best Performing Layer: Layer 44
	- Test Accuracy: 75.12%
	- Test AUC: 0.8457
	- Test F1: 0.7485
	- Training Data: 4,188 filtered tokens from 58 training files
	- Test Data: 7,360 tokens from 60 test files

	### Key Insight
	The probe trained only on evaluation-aware tokens achieves 75% test accuracy, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response.

	---

	## Files

	### Training Artifacts
	- `eval_deploy_sentences_all.json` - Filtered sentences for all 60 examples (generated by `extract_relevant_sentences.py`)
	- `extract_relevant_sentences.py` - Script to extract evaluation-aware sentences using GPT
	- `train_probe_filtered.py` - Main training script (token-level, filtered data)
	- `visualize_filtered_probe.py` - Visualization script for per-token probe predictions

	### Output Results
	- `../probe_outputs_filtered/filtered_probe_results.csv` - Per-layer metrics table
	- `../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib` - Trained probe models (20 layers)
	- `visualizations/` - HTML per-token visualizations (orange-bordered tokens = training tokens)

	---

	## Methodology

	### 1. Sentence Filtering (GPT-powered)
	Using `extract_relevant_sentences.py`, we identify sentences in each response that are "evaluation-aware":
	- Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc.
	- Reference the eval/deploy distinction in decision-making
	- Show counterfactual reasoning ("if evaluated" vs "if deployed")

	Example filtered sentences for one problem:
	```
	"The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..."
	"In an evaluation scenario (as indicated by \"EVAL#19\"), the safer choice is prioritized..."
	"(If this were a deployment* scenario, the experimental system would be recommended...)"
	```

	### 2. Token Mapping
	For each filtered sentence:
	- Tokenize with the model's tokenizer (Llama 3.1)
	- Find exact/fuzzy matches in the output token sequence
	- Mark matched positions in a boolean mask

	### 3. Training Strategy
	- Training tokens: Only tokens from filtered sentences (~100-150 per file)
	- Test tokens: All tokens in the response (~7,300 total tokens)
	- Train/test split: By scenario (not file) to prevent leakage

	```
	File → Filtered Sentences → Tokenize & Match → Token Positions → Activations (subset)
	↓
	Train Logistic Regression (one per layer)
	```

	### 4. Per-Layer Probes
	Trained a separate probe for each layer (layers 0, 4, 8, ..., 76):
	- Input: Token-level activations (no pooling)
	- Output: Per-token probability p_t ∈ [0,1] of being in "eval" mode
	- File-level prediction: Mean across all tokens

	---

	## Usage

	### Train the Filtered Probe
	```bash
	python train_probe_filtered.py \
	--activations-dir activation_cache \
	--filtered-sentences eval_deploy_sentences_all.json \
	--csv-file probe/data/gemini_risk_scenarios_prepared.csv \
	--output-dir ../probe_outputs_filtered \
	--C 0.01 \
	--save-probes
	```

	### Visualize Per-Token Predictions
	```bash
	$env:PYTHONIOENCODING="utf-8" # On Windows PowerShell
	python visualize_filtered_probe.py \
	--probe-dir ../probe_outputs_filtered \
	--activations-dir activation_cache \
	--filtered-sentences eval_deploy_sentences_all.json \
	--layer 44 \
	--output-dir visualizations \
	--max-problems 20
	```

	Visualization features:
	- Color scale: Red (deploy, p→0) → White (neutral, p≈0.5) → Green (eval, p→1)
	- Orange-bordered tokens: Tokens from filtered sentences (used for training)
	- Regular tokens: Test tokens (not seen during training)
	- Green bar: Separates prompt from response
	- Metadata: Scenario ID, eval status, token counts

	### Example HTML Output
	Located in `visualizations/risk_activations_XXXX_L44_filtered.html`

	---

	## Analysis

	### Why Filtered Tokens Work
	1. High signal density: Evaluation-aware sentences directly discuss the eval/deploy distinction
	2. Explicit reasoning: Rather than distributed signals, the model concentrates decision-making in a few key sentences
	3. Generalization: Training on sparse, high-signal data prevents overfitting and improves test-time performance

	### Performance Progression
	```
	Layer 0: Test Acc = 47.9% (random-like)
	Layer 12: Test Acc = 55.1% (signal emerges)
	Layer 36: Test Acc = 74.8% (strong signal)
	Layer 44: Test Acc = 75.1% (BEST - sweet spot)
	Layer 72: Test Acc = 74.9% (slight degradation)
	Layer 76: Test Acc = 74.9% (stabilizes)
	```

	The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal.

	---

	## Technical Details

	### Token Counts
	- Training files: 58 (29 eval, 29 deploy)
	- Training tokens: 4,188 (only from filtered sentences)
	- Test files: 60 (30 eval, 30 deploy)
	- Test tokens: 7,360 (all tokens, full sequences)

	### Activation Extraction
	- Model: Llama-3.3-Nemotron-Super-49B
	- Layers captured: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76
	- Hidden dimension: 8,192
	- Activation format: `.npz` files from HuggingFace (reilleo/risk_activations)

	### Hyperparameters
	- Regularization (C): 0.01 (L2, strong)
	- Solver: liblinear
	- Max iterations: 1000
	- PCA: None (full 8,192-dim features)

	---

	## Next Steps

	1. Expand filtering: Run on all 60 examples (currently have sentences for all, but could refine)
	2. Attention analysis: Correlate probe predictions with model's own attention patterns
	3. Cross-model evaluation: Test on other LLMs (GPT, Claude, etc.)
	4. Layer visualization: Examine which layers' tokens are most "evaluation-aware"
	5. Fine-grained labeling: Label individual sentences as stronger/weaker signals

	---

	## References

	- Sentence filtering: `extract_relevant_sentences.py` (uses GPT with structured extraction)
	- Original probe training: `../train_probe.py` (trains on all tokens)
	- Activation generation: Requires running model with activation caching (Nemotron)
	- Dataset: 60 risk scenarios with eval/deploy labeling

	---

	Created: October 2024
	Status: Complete (Layer 44 selected as best)