SAE Trait Annotation for Organismal Images
Sparse Autoencoder (SAE) checkpoint from the ICLR 2026 paper: Automatic Image-Level Morphological Trait Annotation for Organismal Images
Authors: Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su
- Website: osu-nlp-group.github.io/sae-trait-annotation
- Code: OSU-NLP-Group/sae-trait-annotation
- Dataset: osunlp/bioscan-traits
Model Description
This SAE is trained on penultimate-layer activations of a DINOv2 ViT-B/14 model applied to insect images from BIOSCAN-5M. Its latents capture interpretable visual features that correspond to species-level morphological traits (e.g., wing venation, body coloration, antennal structure). These latents are used to steer a multimodal LLM (Qwen2.5-VL-72B) into generating natural-language trait annotations.
Architecture:
- Base encoder: DINOv2 ViT-B/14 (frozen), activations from layer
-2 - SAE input dimension (
d-vit): 768 - Expansion factor: 32 → 24,576 latent dimensions
- Training data: patch-level activations from BIOSCAN-5M
Usage
Clone the code repository (which vendors the saev library), then load and run the SAE as follows:
import torch
import saev.nn
import saev.activations
from torchvision import datasets
from torch.utils.data import DataLoader
device = "cuda" if torch.cuda.is_available() else "cpu"
# Build the image transform and DINOv2 ViT-B/14 backbone
img_transform = saev.activations.make_img_transform("dinov2", "sae.pt")
vit = saev.activations.make_vit("dinov2", "dinov2_vitb14")
# Wrap the ViT to record activations from layer 10 (penultimate), 256 patches
recorded_vit = saev.activations.RecordedVisionTransformer(
vit, n_patches=256, cls_token=True, layers=[10]
).to(device)
# Load the SAE checkpoint
sae = saev.nn.load("sae.pt").to(device)
sae.eval()
# --- Encode a batch of images ---
# dataset: torchvision ImageFolder with images at 224x224
dataset = datasets.ImageFolder(root="/path/to/images/train")
def collate_fn(batch):
images, labels = zip(*batch)
return list(images), torch.tensor(labels)
loader = DataLoader(dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)
with torch.no_grad():
for images, labels in loader:
images_t = torch.stack(img_transform(images)).to(device)
# vit_acts: (batch, n_layers, n_patches+1, d_vit)
_, vit_acts = recorded_vit(images_t)
# Select layer 0 of the recorded layers, drop the CLS token
vit_acts = vit_acts[:, 0, 1:, :] # (batch, 256, 768)
# SAE forward: returns (reconstruction, features, aux)
_, f_x, _ = sae(vit_acts) # f_x: (batch, 256, 24576)
# Threshold activations to find active latents (default thresh=0.9)
active = (f_x > 0.9) # (batch, 256, 24576) bool
The active latent indices per patch identify which SAE dimensions fire on each image region. These are used downstream to find species-prominent latents and generate trait annotations via an MLLM. See create_trait_dataset_mllm_sae.py for the full pipeline.
Training Details
- Training data: BIOSCAN-5M insect images preprocessed into
ImageFolderlayout - Learning rate: 1e-3
- Sparsity coefficient (alpha):: 4e-4
- Data patches: patch-level (256 patches/image), unscaled mean and norm
Intended Use
- Generating morphological trait annotations for organismal (insect) images
- Interpretability research on vision foundation models via SAE latent analysis
- Downstream fine-tuning of classifiers using trait-annotated data (e.g., with BioCLIP)
Citation
@inproceedings{
pahuja2026automatic,
title={Automatic Image-Level Morphological Trait Annotation for Organismal Images},
author={Vardaan Pahuja and Samuel Stevens and Alyson East and Sydne Record and Yu Su},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=oFRbiaib5Q}
}
Acknowledgments
Code
- SAEV for sparse autoencoder training infrastructure.
- BioCLIP for downstream training/evaluation tooling.
Funding
This research was supported in part by NSF CAREER #2443149, NSF OAC 2118240, and an Alfred P. Sloan Foundation Fellowship. Computational resources were provided by the Ohio Supercomputer Center.
S. Record and A. East were additionally supported by NSF Award No. 242918 (EPSCOR Research Fellows: Advancing NEON-Enabled Science and Workforce Development at the University of Maine with AI) and Hatch project Award #MEO-022425 from the USDA National Institute of Food and Agriculture.
People
We thank colleagues in the OSU NLP group for valuable feedback. This work was in part conceived at Funcapalooza.