Marlin Lee
Switch to modular explorer package (scripts/explorer/)
4c1c394
# SAE Feature Explorer
Interactive Bokeh app for exploring SAE (Sparse Autoencoder) features.
Launch with `run_explorer.sh` or directly:
```bash
bokeh serve scripts/explorer --port 5006 --args --data path/to/explorer_data.pt --image-dir /path/to/images
```
The app is then accessible at `http://localhost:5006/explorer`.
---
## File Structure
```
explorer/
β”œβ”€β”€ main.py # Entry point: loads data, wires callbacks, builds layout
β”œβ”€β”€ args.py # All CLI argument definitions (argparse)
β”œβ”€β”€ state.py # Shared mutable state and dataset registry
β”œβ”€β”€ datasets.py # Dataset loading from .pt files
β”œβ”€β”€ inference.py # Lazy GPU backbone+SAE and CLIP model loaders
β”œβ”€β”€ brain.py # Brain-alignment data (Phi) and DynaDiff setup
β”œβ”€β”€ rendering.py # Pure image/HTML rendering functions
β”œβ”€β”€ widgets.py # Shared Bokeh widgets used across multiple panels
└── panels/
β”œβ”€β”€ umap.py # UMAP scatter plot and controls
β”œβ”€β”€ feature.py # Feature detail view (heatmaps, stats, cortical profile)
β”œβ”€β”€ feature_list.py # Sortable feature table, name search, Gemini auto-interp
β”œβ”€β”€ patch.py # Patch explorer (click image regions β†’ active features)
β”œβ”€β”€ clip_search.py # Free-text CLIP feature search
└── dynadiff.py # DynaDiff brain-steering panel
```
---
## Module Responsibilities
### `main.py`
The Bokeh server entry point. Responsible for three things only:
1. Calling `load_all_datasets()` before any panels are imported
2. Wiring the cross-panel callbacks that need to touch multiple panels (UMAP tap, dataset switch, Go/Random buttons)
3. Assembling the three-column page layout and registering it with `curdoc()`
### `args.py`
Defines and parses all CLI arguments. Exposes a single `args` object imported by any module that needs a CLI value. Parsed once at startup.
### `state.py`
Two things live here:
- `_S` β€” a plain class used as a mutable namespace for session scalars (active dataset index, render token, search filter, etc.). No `global` statements needed anywhere else.
- `_all_datasets` β€” the list of loaded dataset dicts. `active_ds()` returns `_all_datasets[_S.active]` and is the standard way for any panel to read the current dataset.
- `display_name(feat)` β€” returns the best label for a feature (manual β†’ auto-interp β†’ empty).
### `datasets.py`
Contains a single `_load_dataset()` function that handles both regular `explorer_data.pt` files and `brain_meis.pt` files (previously two separate, mostly-duplicate functions). Every loaded dataset is a plain dict with a fixed schema. Derived arrays (`freq`, `log_freq`, `live_mask`, `umap_backup`, etc.) are computed once at load time and stored in the dict, so nothing is recomputed when switching datasets.
### `inference.py`
Lazy loaders for the two optional GPU resources:
- `get_clip()` β€” loads the CLIP model on first free-text search query
- `get_gpu_runner()` β€” loads the backbone + SAE on first patch-explorer use
- `run_gpu_inference(pil_img)` β€” runs an image through the loaded backbone+SAE
Neither is loaded unless actually needed.
### `brain.py`
Loaded at import time from `--phi-dir` and `--dynadiff-dir`. Sets the module-level flags `HAS_PHI` and `HAS_DYNADIFF` that panels check to show/hide features. Provides helper functions for reading per-feature phi data (`phi_voxel_row`, `phi_c_for_feat`, `phi_c_vals`) and rendering brain scatter plots (`render_cortical_profile`, `render_steering_preview`).
### `rendering.py`
Pure functions β€” no Bokeh widget dependencies. Safe to call from worker threads. Covers:
- Image loading (`load_image`, `resolve_img_path`, `parse_img_label`)
- Heatmap blending (`render_heatmap_overlay`, `render_zoomed_overlay`)
- PIL/Bokeh conversion (`pil_to_data_url`, `pil_to_bokeh_rgba`)
- HTML builders (`make_image_grid_html`, `make_compare_aggregations_html`, `make_steering_html`, `status_html`)
- Layout helper (`make_collapsible`)
`render_zoomed_overlay` takes `zoom_patches` and `alpha` as explicit parameters rather than reading from widgets, keeping it a pure function.
### `widgets.py`
Bokeh widget instances that are read or written by more than one panel. Creating them here avoids duplication and makes the sharing explicit. Includes `feature_input`, `go_button`, `random_btn`, `zoom_slider`, `heatmap_alpha_slider`, `nsd_subset_toggle`, and `view_select`.
### `panels/umap.py`
Owns the UMAP `ColumnDataSource`, the Bokeh figure, the color mapper, and the type/color `Select` controls. Exposes `rebuild_umap_source(ds)` called by `main.py` on dataset switch.
### `panels/feature.py`
Owns the middle column: `stats_div`, `status_div`, `top_heatmap_div`, `mean_heatmap_div`, `compare_agg_div`, `brain_div`. The core function `update_feature_display(feat)` schedules rendering via `add_next_tick_callback` to keep the UI responsive. Uses a render token to discard stale renders if the user clicks quickly. `select_and_display(feat)` additionally syncs the UMAP highlight.
### `panels/feature_list.py`
The sortable feature DataTable on the left, plus:
- Name search (filters the table)
- Manual name editing (auto-saved to JSON, optionally pushed to HuggingFace)
- Gemini auto-interp button (calls Gemini API in a worker thread)
### `panels/patch.py`
The patch explorer: load an image, click or drag-paint patches, see the top SAE features for that region. Activations come from `inference.run_gpu_inference()` with a per-dataset LRU cache.
### `panels/clip_search.py`
Free-text CLIP search against precomputed per-feature image embeddings. Builds a stub Div if no CLIP embeddings are present in the active dataset.
### `panels/dynadiff.py`
Brain-steering panel. Users build a feature list with per-feature Ξ» and brain-voxel threshold values, choose a sample index, and trigger DynaDiff reconstruction. Runs in a worker thread with result pushed back via `add_next_tick_callback`. Builds stub widgets if `HAS_DYNADIFF` is False.
---
## Data Flow
```
CLI args (args.py)
β”‚
β–Ό
load_all_datasets() ──► _all_datasets (state.py)
β”‚
β–Ό
active_ds() ◄── _S.active (state.py)
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
umap.py feature.py feature_list.py ...
```
All panels read data through `active_ds()['key']`. On a dataset switch, `main.py` updates `_S.active` and calls each panel's `rebuild_*` function β€” no global variable rebinding required.
---
## Cross-Panel Calls
Panels occasionally need to call functions defined in other panels (e.g., clicking a row in the CLIP results should trigger the feature detail view). These calls are always made from inside callback functions using lazy imports, which avoids circular imports at module load time:
```python
# Inside a callback in clip_search.py β€” imported lazily, not at module level
def _on_result_select(attr, old, new):
from .feature import select_and_display
select_and_display(feat)
```
The one exception is `main.py`, which imports from all panels and wires the few truly cross-cutting callbacks (UMAP tap β†’ feature display, Go/Random buttons).