ReSiReg Mini
Compact vision-language model (25M parameter vision-path), as described in the paper ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks by Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, and Gerald Steinbauer-Wagner.
The vision-path is kept as small as possible while retaining language-grounding and spatial consistency in patch token. This is attractive for robotic control tasks, where prompts are typically sparsely encoded with many frequent vision-path queries during controller updates.
Model details
- Architecture: EUPE image backbone + Vision-language tower. Projections from vision and SigLIP2 to a shared embedding space.
- Base models:
facebook/EUPE-ViT-Sandgoogle/siglip2-base-patch16-224. - Trained on ~600k image caption pairs from the cauldron, COCO caption, and pexels_568 datasets.
Single-View ReSiReg Feature Reconstruction Demo
Minimal Example: Dense Image-Prompt Similarity
import torch
import torch.nn.functional as F
from PIL import Image
import matplotlib.pyplot as plt
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
"SimonSchwaiger/resireg_mini",
trust_remote_code=True
).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained("SimonSchwaiger/resireg_mini")
image_processor = AutoImageProcessor.from_pretrained("SimonSchwaiger/resireg_mini")
img = Image.open("testimg.png").convert("RGB")
prompt = "red mug"
pixel_values = image_processor(images=img, return_tensors="pt")["pixel_values"].to(device)
tok = tokenizer([prompt], padding=True, truncation=True, max_length=64, return_tensors="pt")
tok = {k: v.to(device) for k, v in tok.items()}
with torch.no_grad():
out = model(
pixel_values=pixel_values,
input_ids=tok["input_ids"],
attention_mask=tok.get("attention_mask"),
)
dense = out.dense_embeds_resireg_lite # [1, C, H, W]
text = out.text_embeds # [1, C]
dense_n = F.normalize(dense, dim=1)
text_n = F.normalize(text, dim=-1)
sim = torch.einsum("bchw,bc->bhw", dense_n, text_n)[0] # [H, W]
sim = (sim - sim.min()) / (sim.max() - sim.min() + 1e-8)
heat = F.interpolate(
sim.unsqueeze(0).unsqueeze(0),
size=(img.height, img.width),
mode="bilinear",
align_corners=False,
)[0, 0].cpu().numpy()
plt.figure(figsize=(8, 8))
plt.imshow(img)
plt.imshow(heat, cmap="jet", alpha=0.45)
plt.axis("off")
plt.title(f"Patch/Text Cosine Similarity (ReSiReg-Lite): '{prompt}'")
plt.tight_layout()
plt.show()
License and attribution
This repository is released under CC-BY-SA-4.0 for original code and documentation in this repo.
This model build depends on upstream components with their own licenses:
- EUPE (
facebook/EUPE,facebook/EUPE-ViT-S): FAIR Noncommercial Research License- Source: EUPE LICENSE.md
- Weights card: facebook/EUPE-ViT-S
- SigLIP2 (
google/siglip2-base-patch16-224): Apache-2.0 - C-RADIOv3-B (used for RADSeg teacher distillation): NVIDIA Open Model License
- Source: nvidia/C-RADIOv3-B README
To note: EUPE is noncommercial-research licensed, so downstream usage of this combined model must comply with that restriction.
- Downloads last month
- 62