MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Changli Wu^1,2,*, Haodong Wang^1,*, Jiayi Ji¹, Yutian Yao⁵, Chunsai Du⁴, Jihua Kang⁴, Yanwei Fu^3,2, Liujuan Cao^1,†

¹Xiamen University, ²Shanghai Innovation Institute, ³Fudan University, ⁴ByteDance, ⁵Tianjin University of Science and Technology
(^* Equal Contribution, ^† Corresponding Author)

🚀 Introduction

This repository contains the pre-trained weights for MVGGT.

MVGGT stands for Multimodal Visual Geometry Grounded Transformer. This project targets the challenge of Multi-view 3D Referring Expression Segmentation, where the goal is to segment a specific 3D object described by a natural language query from a sequence of posed images.

💻 Demo

Try out our interactive web demo on Hugging Face Spaces: 👉 MVGGT Demo Space

🛠️ Usage

To use this weight, please clone the official GitHub repository and use the provided inference scripts.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using sosppxo/mvggt 1

Paper for sosppxo/mvggt

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Paper • 2601.06874 • Published Jan 11 • 12