attribution-graph-probing / scripts /visualization /README_ACTIVATION_HEATMAP.md
peppinob-ol
Initial deployment: Attribution Graph Probing app
cb8a7e5

Activation Heatmap Visualization

This script creates Neuronpedia-style heatmap visualizations of token activations from activation dump JSON files.

Overview

The script implements the same logarithmic scaling and color mapping approach used by Neuronpedia to visualize how features activate on different tokens. Each token is displayed with a green background whose intensity corresponds to the activation value.

Features

  • Logarithmic scaling: Uses the same formula as Neuronpedia for opacity mapping
  • Individual feature visualizations: Shows each feature's activations across tokens
  • Combined heatmap: Matrix view of all features vs all tokens
  • Color coding: Green intensity from light (low activation) to dark (high activation)
  • Special token handling: Replaces special tokens with displayable characters
  • Value display: Optional display of numerical activation values on tokens

Usage

Basic Usage

python scripts/visualization/activation_heatmap.py path/to/activations_dump.json

This will create visualizations in output/activation_heatmaps/ by default.

Command Line Options

python scripts/visualization/activation_heatmap.py INPUT_JSON [OPTIONS]

Required:
  INPUT_JSON              Path to activation dump JSON file

Optional:
  -o, --output-dir DIR    Output directory (default: output/activation_heatmaps)
  -k, --top-k N          Number of top features to visualize (default: 10)
  --probe-index N        Index of probe result to visualize (default: 0)
  --tokens-per-row N     Tokens per row in visualization (default: 20)
  --no-values            Hide activation values on tokens
  --combined-only        Only generate combined heatmap

Examples

Visualize top 5 features from a specific probe:

python scripts/visualization/activation_heatmap.py \
  "output/examples/Dallas/activations_dump (2).json" \
  -k 5 \
  --probe-index 0 \
  -o output/my_visualizations

Generate only the combined heatmap:

python scripts/visualization/activation_heatmap.py \
  "output/examples/Dallas/activations_dump (2).json" \
  --combined-only

Adjust layout for longer prompts:

python scripts/visualization/activation_heatmap.py \
  "output/examples/Dallas/activations_dump (2).json" \
  --tokens-per-row 30

Input Format

The script expects JSON files with this structure:

{
  "model": "model-name",
  "results": [
    {
      "probe_id": "probe_0_Dallas",
      "prompt": "entity: A city in Texas, USA is Dallas",
      "tokens": ["<bos>", "entity", ":", " A", ...],
      "counts": [[9813.0, 72.0, ...], ...] // OR features array
    }
  ]
}

Supports both:

  • Legacy counts format: 2D array [n_features][n_tokens]
  • New features format: List of feature objects with metadata

Output

The script generates:

  1. Combined heatmap (combined_heatmap.png): Matrix visualization showing all features
  2. Individual feature images (one per top-K feature): Detailed view of each feature's activations

All images are saved to: {output_dir}/probe_{index}/

Color Scheme

  • Base color: Emerald green (RGB: 52, 211, 153)
  • Opacity range: 0.05 (minimum) to 1.0 (maximum)
  • Threshold: Values below 0.00005 are not highlighted
  • Text color: Black on light backgrounds, white on dark backgrounds

Implementation Details

The visualization uses the exact same logarithmic opacity calculation as Neuronpedia:

opacity = MINIMUM_OPACITY + (log10(1 + 9 * ratio) * scale) / log10(10)

Where:

  • ratio = current_value / max_value
  • scale = 1 - MINIMUM_OPACITY
  • MINIMUM_OPACITY = 0.05

This creates a perceptually uniform color gradient that emphasizes differences in lower activation ranges while still showing the full dynamic range.

Dependencies

  • matplotlib
  • numpy
  • Python 3.7+

Install with:

pip install matplotlib numpy