T-REN / README.md

Upload README.md with huggingface_hub

66c4f84 verified 3 days ago

4.41 kB

	---
	license: mit
	language:
	- en
	tags:
	- vision
	- image-segmentation
	- image-feature-extraction
	- region-tokens
	- dinov3
	- pytorch
	library_name: transformers
	---

	# T-REN: Text-Aligned Region Encoder Network

	Authors: [Savya Khosla](https://savya08.github.io/), [Sethuraman TV](https://github.com/sethuramanio), [Aryan Chadha](https://www.linkedin.com/in/aryan-chadha/), [Alex Schwing](https://www.alexander-schwing.de/), [Derek Hoiem](https://dhoiem.cs.illinois.edu/)

	[![GitHub](https://img.shields.io/badge/GitHub-Code-black.svg)](https://github.com/savya08/T-REN)

	T-REN (Text-aligned Region Encoder Network) is an image encoder that produces region-level tokens aligned with text, built on top of [DINOv3](https://github.com/facebookresearch/dinov3) ViT-L/16. Compared to its patch-based backbone, T-REN delivers:

	- +5.9 mIoU on ADE20K open-vocabulary segmentation
	- +18.4% recall on COCO object-level text-image retrieval
	- +15.6% recall on Ego4D video object localization (VQ2D)
	- +17.6% mIoU on VSPW video scene parsing
	- 24× fewer tokens per image, 187× fewer per video

	---

	## What's in this repo

	This HuggingFace repo contains:
	- `model.safetensors` — the trained `RegionEncoder` head weights (~1.2 GB)
	- `configuration_tren.py`, `modeling_tren.py`, `model.py`, `task_utils.py` — source code for `trust_remote_code`

	The DINOv3 ViT-L/16 backbone is NOT included here — it belongs to Facebook Research and must be obtained separately (see below).

	---

	## Quickstart

	### Step 1 — Install dependencies

	```bash
	pip install transformers torch torchvision kornia
	```

	### Step 2 — Get the DINOv3 weights

	T-REN's backbone is DINOv3 ViT-L/16 with a DINOtxt text-alignment head. You need two weight files from the [DINOv3 release](https://github.com/facebookresearch/dinov3):

	\| File \| Description \|
	\|------\|-------------\|
	\| `dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth` \| DINOv3 ViT-L/16 backbone \|
	\| `dinov3_vitl16_dinotxt_vision_head_and_text_encoder-a442d8f5.pth` \| DINOtxt vision head + text encoder \|

	Place both files in the same directory, e.g. `/path/to/dinov3_weights/`.

	### Step 3 — Load and run T-REN

	```python
	import torch
	import torchvision.transforms as T
	from PIL import Image
	from transformers import AutoModel

	# Load model (downloads T-REN weights from this repo automatically)
	model = AutoModel.from_pretrained("aryaaan12/T-REN", trust_remote_code=True)

	# Load the DINOv3 backbone from your local directory
	model.load_backbone("/path/to/dinov3_weights/")

	model.eval()

	# Prepare an image — resize to 512x512, values in [0, 1]
	transform = T.Compose([
	T.Resize((512, 512)),
	T.ToTensor(),
	])
	image = transform(Image.open("your_image.jpg").convert("RGB"))
	image = image.unsqueeze(0) # (1, 3, 512, 512)

	# Run T-REN
	with torch.no_grad():
	outputs = model(image)

	# Outputs
	region_tokens = outputs["text_aligned_tokens"] # list of (N, 1024) per image
	region_masks = outputs["region_masks"] # list of (N, 32, 32) per image
	class_token = outputs["class_tokens"] # (1, 1024) image-level token
	print(f"Number of region tokens: {len(region_tokens[0])}")
	```

	### Text-guided region matching

	```python
	import torch.nn.functional as F

	texts = ["sky", "car", "building", "tree", "road"]

	with torch.no_grad():
	outputs = model(image, texts=texts)

	region_tokens = outputs["text_aligned_tokens"][0] # (N, 1024)
	text_tokens = outputs["text_encodings"] # (5, 1024)

	# Cosine similarity: which text label fits each region best?
	sim = F.normalize(region_tokens, dim=-1) @ F.normalize(text_tokens, dim=-1).T
	best_labels = sim.argmax(dim=-1)
	print([texts[i] for i in best_labels])
	```

	---

	## Model details

	\| \| \|
	\|---\|---\|
	\| Architecture \| RegionEncoder (cross-attention decoder) over DINOv3 ViT-L/16 features \|
	\| Trainable parameters \| 31.5M (RegionEncoder head only; backbone is frozen) \|
	\| Input resolution \| 512 × 512 \|
	\| Output token dim \| 1024 \|
	\| Multiscale regions \| 3 scales per prompt point \|
	\| Text embedding space \| DINOtxt (aligned with DINOv3 text encoder) \|

	---

	## Citation

	```bibtex
	@misc{khosla2026tren,
	title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability},
	author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
	year={2026},
	}
	```