EoMT-DINOv3 (Large, 640px) for COCO Instance Segmentation

Overview

This is the large variant of the EoMT-DINOv3 model trained for instance segmentation on COCO at 640×640 resolution.

EoMT (Encoder-only Mask Transformer) is a Vision Transformer (ViT) architecture designed for high-quality and efficient image segmentation. It was introduced in the CVPR 2025 highlight paper:
Your ViT is Secretly an Image Segmentation Model

Key Insight: Given sufficient scale and pretraining, a plain ViT along with a few additional parameters can perform segmentation without the need for task-specific decoders or pixel fusion modules. The same model backbone supports semantic, instance, and panoptic segmentation with different post-processing.

The DINOv3 variants leverage rotary position embeddings and the latest pre-training recipes from Meta AI, yielding measurable performance gains across segmentation tasks.

Usage

import requests
import torch
from PIL import Image

from transformers import AutoImageProcessor, EomtDinov3ForUniversalSegmentation

model_id = "nielsr/eomt-dinov3-coco-instance-large-640"
processor = AutoImageProcessor.from_pretrained(model_id)
model = EomtDinov3ForUniversalSegmentation.from_pretrained(model_id)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt").to(device)

with torch.inference_mode():
    outputs = model(**inputs)

# Instance Segmentation
result = processor.post_process_instance_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
print(result["segmentation"].shape)  # Segmentation map
print(result["segments_info"])  # List of detected segments with labels

Model Details

Property	Value
Backbone	DINOv3 ViT-L/16
Input Resolution	640×640
Task	Instance Segmentation
Dataset	COCO

Citation

@inproceedings{kerssies2025eomt,
  author    = {Kerssies, Tommie and Cavagnero, Niccolò and Hermans, Alexander and Norouzi, Narges and Averta, Giuseppe and Leibe, Bastian and Dubbelman, Gijs and de Geus, Daan},
  title     = {Your ViT is Secretly an Image Segmentation Model},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
}

Acknowledgements

Original implementation: tue-mps/eomt
Paper: arXiv:2503.19108

Downloads last month: 3,665

Safetensors

Model size

0.3B params

Tensor type

F32

Paper for tue-mps/eomt-dinov3-coco-instance-large-640

Your ViT is Secretly an Image Segmentation Model

Paper • 2503.19108 • Published Mar 24, 2025 • 25