ReSiReg Mini

Compact vision-language model (25M parameter vision-path), as described in the paper ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks by Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, and Gerald Steinbauer-Wagner.

The vision-path is kept as small as possible while retaining language-grounding and spatial consistency in patch token. This is attractive for robotic control tasks, where prompts are typically sparsely encoded with many frequent vision-path queries during controller updates.

Model details

  • Architecture: EUPE image backbone + Vision-language tower. Projections from vision and SigLIP2 to a shared embedding space.
  • Base models: facebook/EUPE-ViT-S and google/siglip2-base-patch16-224.
  • Trained on ~600k image caption pairs from the cauldron, COCO caption, and pexels_568 datasets.

Single-View ReSiReg Feature Reconstruction Demo

ReSiReg Mini Inference Example


Minimal Example: Dense Image-Prompt Similarity

import torch
import torch.nn.functional as F
from PIL import Image
import matplotlib.pyplot as plt
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    "SimonSchwaiger/resireg_mini",
    trust_remote_code=True
).to(device).eval()

tokenizer = AutoTokenizer.from_pretrained("SimonSchwaiger/resireg_mini")
image_processor = AutoImageProcessor.from_pretrained("SimonSchwaiger/resireg_mini")

img = Image.open("testimg.png").convert("RGB")
prompt = "red mug"

pixel_values = image_processor(images=img, return_tensors="pt")["pixel_values"].to(device)
tok = tokenizer([prompt], padding=True, truncation=True, max_length=64, return_tensors="pt")
tok = {k: v.to(device) for k, v in tok.items()}

with torch.no_grad():
    out = model(
        pixel_values=pixel_values,
        input_ids=tok["input_ids"],
        attention_mask=tok.get("attention_mask"),
    )

dense = out.dense_embeds_resireg_lite  # [1, C, H, W]
text = out.text_embeds                 # [1, C]

dense_n = F.normalize(dense, dim=1)
text_n = F.normalize(text, dim=-1)

sim = torch.einsum("bchw,bc->bhw", dense_n, text_n)[0]  # [H, W]
sim = (sim - sim.min()) / (sim.max() - sim.min() + 1e-8)

heat = F.interpolate(
    sim.unsqueeze(0).unsqueeze(0),
    size=(img.height, img.width),
    mode="bilinear",
    align_corners=False,
)[0, 0].cpu().numpy()

plt.figure(figsize=(8, 8))
plt.imshow(img)
plt.imshow(heat, cmap="jet", alpha=0.45)
plt.axis("off")
plt.title(f"Patch/Text Cosine Similarity (ReSiReg-Lite): '{prompt}'")
plt.tight_layout()
plt.show()

License and attribution

This repository is released under CC-BY-SA-4.0 for original code and documentation in this repo.

This model build depends on upstream components with their own licenses:

To note: EUPE is noncommercial-research licensed, so downstream usage of this combined model must comply with that restriction.

Downloads last month
62
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using SimonSchwaiger/resireg_mini 1