Image Segmentation
Transformers
Safetensors
PyTorch
English
tren
feature-extraction
vision
image-feature-extraction
region-tokens
dinov3
custom_code
Instructions to use aryaaan12/T-REN with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aryaaan12/T-REN with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="aryaaan12/T-REN", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("aryaaan12/T-REN", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
T-REN: Text-Aligned Region Encoder Network
Authors: Savya Khosla, Sethuraman TV, Aryan Chadha, Alex Schwing, Derek Hoiem
T-REN (Text-aligned Region Encoder Network) is an image encoder that produces region-level tokens aligned with text, built on top of DINOv3 ViT-L/16. Compared to its patch-based backbone, T-REN delivers:
- +5.9 mIoU on ADE20K open-vocabulary segmentation
- +18.4% recall on COCO object-level text-image retrieval
- +15.6% recall on Ego4D video object localization (VQ2D)
- +17.6% mIoU on VSPW video scene parsing
- 24Γ fewer tokens per image, 187Γ fewer per video
What's in this repo
This HuggingFace repo contains:
model.safetensorsβ the trainedRegionEncoderhead weights (~1.2 GB)configuration_tren.py,modeling_tren.py,model.py,task_utils.pyβ source code fortrust_remote_code
The DINOv3 ViT-L/16 backbone is NOT included here β it belongs to Facebook Research and must be obtained separately (see below).
Quickstart
Step 1 β Install dependencies
pip install transformers torch torchvision kornia
Step 2 β Get the DINOv3 weights
T-REN's backbone is DINOv3 ViT-L/16 with a DINOtxt text-alignment head. You need two weight files from the DINOv3 release:
| File | Description |
|---|---|
dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth |
DINOv3 ViT-L/16 backbone |
dinov3_vitl16_dinotxt_vision_head_and_text_encoder-a442d8f5.pth |
DINOtxt vision head + text encoder |
Place both files in the same directory, e.g. /path/to/dinov3_weights/.
Step 3 β Load and run T-REN
import torch
import torchvision.transforms as T
from PIL import Image
from transformers import AutoModel
# Load model (downloads T-REN weights from this repo automatically)
model = AutoModel.from_pretrained("aryaaan12/T-REN", trust_remote_code=True)
# Load the DINOv3 backbone from your local directory
model.load_backbone("/path/to/dinov3_weights/")
model.eval()
# Prepare an image β resize to 512x512, values in [0, 1]
transform = T.Compose([
T.Resize((512, 512)),
T.ToTensor(),
])
image = transform(Image.open("your_image.jpg").convert("RGB"))
image = image.unsqueeze(0) # (1, 3, 512, 512)
# Run T-REN
with torch.no_grad():
outputs = model(image)
# Outputs
region_tokens = outputs["text_aligned_tokens"] # list of (N, 1024) per image
region_masks = outputs["region_masks"] # list of (N, 32, 32) per image
class_token = outputs["class_tokens"] # (1, 1024) image-level token
print(f"Number of region tokens: {len(region_tokens[0])}")
Text-guided region matching
import torch.nn.functional as F
texts = ["sky", "car", "building", "tree", "road"]
with torch.no_grad():
outputs = model(image, texts=texts)
region_tokens = outputs["text_aligned_tokens"][0] # (N, 1024)
text_tokens = outputs["text_encodings"] # (5, 1024)
# Cosine similarity: which text label fits each region best?
sim = F.normalize(region_tokens, dim=-1) @ F.normalize(text_tokens, dim=-1).T
best_labels = sim.argmax(dim=-1)
print([texts[i] for i in best_labels])
Model details
| Architecture | RegionEncoder (cross-attention decoder) over DINOv3 ViT-L/16 features |
| Trainable parameters | 31.5M (RegionEncoder head only; backbone is frozen) |
| Input resolution | 512 Γ 512 |
| Output token dim | 1024 |
| Multiscale regions | 3 scales per prompt point |
| Text embedding space | DINOtxt (aligned with DINOv3 text encoder) |
Citation
@misc{khosla2026tren,
title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability},
author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
year={2026},
}
- Downloads last month
- 122