MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
Changli Wu1,2,* , Haodong Wang1,* , Jiayi Ji1, Yutian Yao5, Chunsai Du4, Jihua Kang4, Yanwei Fu3,2, Liujuan Cao1,β
1Xiamen University, 2Shanghai Innovation Institute, 3Fudan University, 4ByteDance, 5Tianjin University of Science and Technology
(* Equal Contribution, β Corresponding Author)
π Introduction
This repository contains the pre-trained weights for MVGGT.
MVGGT stands for Multimodal Visual Geometry Grounded Transformer. This project targets the challenge of Multi-view 3D Referring Expression Segmentation, where the goal is to segment a specific 3D object described by a natural language query from a sequence of posed images.
π» Demo
Try out our interactive web demo on Hugging Face Spaces: π MVGGT Demo Space
π οΈ Usage
To use this weight, please clone the official GitHub repository and use the provided inference scripts.