T-REN: Text-Aligned Region Encoder Network

Authors: Savya Khosla, Sethuraman TV, Aryan Chadha, Alex Schwing, Derek Hoiem

GitHub

T-REN (Text-aligned Region Encoder Network) is an image encoder that produces region-level tokens aligned with text, built on top of DINOv3 ViT-L/16. Compared to its patch-based backbone, T-REN delivers:

  • +5.9 mIoU on ADE20K open-vocabulary segmentation
  • +18.4% recall on COCO object-level text-image retrieval
  • +15.6% recall on Ego4D video object localization (VQ2D)
  • +17.6% mIoU on VSPW video scene parsing
  • 24Γ— fewer tokens per image, 187Γ— fewer per video

What's in this repo

This HuggingFace repo contains:

  • model.safetensors β€” the trained RegionEncoder head weights (~1.2 GB)
  • configuration_tren.py, modeling_tren.py, model.py, task_utils.py β€” source code for trust_remote_code

The DINOv3 ViT-L/16 backbone is NOT included here β€” it belongs to Facebook Research and must be obtained separately (see below).


Quickstart

Step 1 β€” Install dependencies

pip install transformers torch torchvision kornia

Step 2 β€” Get the DINOv3 weights

T-REN's backbone is DINOv3 ViT-L/16 with a DINOtxt text-alignment head. You need two weight files from the DINOv3 release:

File Description
dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth DINOv3 ViT-L/16 backbone
dinov3_vitl16_dinotxt_vision_head_and_text_encoder-a442d8f5.pth DINOtxt vision head + text encoder

Place both files in the same directory, e.g. /path/to/dinov3_weights/.

Step 3 β€” Load and run T-REN

import torch
import torchvision.transforms as T
from PIL import Image
from transformers import AutoModel

# Load model (downloads T-REN weights from this repo automatically)
model = AutoModel.from_pretrained("aryaaan12/T-REN", trust_remote_code=True)

# Load the DINOv3 backbone from your local directory
model.load_backbone("/path/to/dinov3_weights/")

model.eval()

# Prepare an image β€” resize to 512x512, values in [0, 1]
transform = T.Compose([
    T.Resize((512, 512)),
    T.ToTensor(),
])
image = transform(Image.open("your_image.jpg").convert("RGB"))
image = image.unsqueeze(0)  # (1, 3, 512, 512)

# Run T-REN
with torch.no_grad():
    outputs = model(image)

# Outputs
region_tokens = outputs["text_aligned_tokens"]  # list of (N, 1024) per image
region_masks  = outputs["region_masks"]          # list of (N, 32, 32) per image
class_token   = outputs["class_tokens"]          # (1, 1024) image-level token
print(f"Number of region tokens: {len(region_tokens[0])}")

Text-guided region matching

import torch.nn.functional as F

texts = ["sky", "car", "building", "tree", "road"]

with torch.no_grad():
    outputs = model(image, texts=texts)

region_tokens = outputs["text_aligned_tokens"][0]   # (N, 1024)
text_tokens   = outputs["text_encodings"]           # (5, 1024)

# Cosine similarity: which text label fits each region best?
sim = F.normalize(region_tokens, dim=-1) @ F.normalize(text_tokens, dim=-1).T
best_labels = sim.argmax(dim=-1)
print([texts[i] for i in best_labels])

Model details

Architecture RegionEncoder (cross-attention decoder) over DINOv3 ViT-L/16 features
Trainable parameters 31.5M (RegionEncoder head only; backbone is frozen)
Input resolution 512 Γ— 512
Output token dim 1024
Multiscale regions 3 scales per prompt point
Text embedding space DINOtxt (aligned with DINOv3 text encoder)

Citation

@misc{khosla2026tren,
      title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability},
      author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
      year={2026},
}
Downloads last month
122
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support