Steerable Visual Representations
Paper • 2604.02327 • Published • 38
Paper | Project Page | GitHub
SteerViT equips pretrained Vision Transformers with steerable visual representations. Given an image and a natural-language prompt, it conditions the visual encoder through lightweight gated cross-attention to produce:
This Hugging Face repository hosts the model checkpoints.
To use SteerViT, install the library directly from GitHub:
python -m pip install "git+https://github.com/JonaRuthardt/SteerViT.git"
import torch
from PIL import Image
from steervit import SteerViT
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model (e.g., SteerDINOv2-Base)
model = SteerViT.from_pretrained("steervit_dinov2_base.pth", device=device)
transform = model.get_transforms()
image = Image.open("path/to/image.jpg").convert("RGB")
image_tensor = transform(image).unsqueeze(0)
prompt = ["the red car"]
global_features = model.get_global_features(image_tensor, texts=prompt) # pooled image embeddings
dense_features = model.get_dense_features(image_tensor, texts=prompt) # patch-level visual features
heatmaps = model.get_heatmaps(image_tensor, texts=prompt) # prompt-conditioned localization heatmaps
attention_heatmaps = model.get_attention_heatmaps(image_tensor, texts=prompt) # attention-based heatmaps
If texts=None, SteerViT behaves like the underlying frozen ViT backbone and returns query-agnostic features.
| Checkpoint | from_pretrained(...) identifier |
Notes |
|---|---|---|
| SteerDINOv2-Base | steervit_dinov2_base.pth |
Primary model used for most experiments |
| SteerMAE-Base | steervit_mae_base.pth |
Alternative model based on MAE-backbone |
@misc{ruthardt2026steervit,
title={Steerable Visual Representations},
author={Jona Ruthardt and Manu Gaur and Deva Ramanan and Makarand Tapaswi and Yuki M. Asano},
journal={arXiv:2604.02327},
year={2026}
}
Base model
FacebookAI/roberta-large