MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

arXiv Hugging Face Spaces GitHub

Changli Wu1,2,* , Haodong Wang1,* , Jiayi Ji1, Yutian Yao5, Chunsai Du4, Jihua Kang4, Yanwei Fu3,2, Liujuan Cao1,†

1Xiamen University, 2Shanghai Innovation Institute, 3Fudan University, 4ByteDance, 5Tianjin University of Science and Technology
(* Equal Contribution, † Corresponding Author)

πŸš€ Introduction

This repository contains the pre-trained weights for MVGGT.

MVGGT stands for Multimodal Visual Geometry Grounded Transformer. This project targets the challenge of Multi-view 3D Referring Expression Segmentation, where the goal is to segment a specific 3D object described by a natural language query from a sequence of posed images.

πŸ’» Demo

Try out our interactive web demo on Hugging Face Spaces: πŸ‘‰ MVGGT Demo Space

πŸ› οΈ Usage

To use this weight, please clone the official GitHub repository and use the provided inference scripts.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using sosppxo/mvggt 1

Paper for sosppxo/mvggt