Add pipeline tag, paper, project page, and code links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +165 -3
README.md CHANGED
@@ -1,3 +1,165 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: video-text-to-text
4
+ ---
5
+
6
+ # 4DLangVGGT: 4D Language Visual Geometry Grounded Transformer <br><sub>Official PyTorch Implementation</sub>
7
+
8
+ #### [<code>Project Page 🤩</code>](https://hustvl.github.io/4DLangVGGT/) | [<code>HF Checkpoint 🚀</code>](https://huggingface.co/YajingB/4DLangVGGT) | [<code>Paper 📝</code>](https://huggingface.co/papers/2512.05060)
9
+
10
+ <p align="center">
11
+ 4DLangVGGT: 4D Language Visual Geometry Grounded Transformer
12
+ <br />
13
+ <a href="https://scholar.google.com/citations?user=C9B5JKYAAAAJ&hl=en">Xianfeng Wu<sup>1, 3, 4</sup><sup>#</sup></a>
14
+ ·
15
+ <a href="https://scholar.google.com/citations?user=0bmTpcAAAAAJ&hl=en&oi=ao">Yajing Bai<sup>1, 3</sup><sup>#</sup></a>
16
+ ·
17
+ <a href="https://scholar.google.com/citations?user=LhdBgMAAAAAJ&hl=en">Minghan Li<sup>2</sup></a>
18
+ ·
19
+ <a href="https://openreview.net/profile?id=~Xianzu_Wu1">Xianzu Wu<sup>1, 5</sup></a>
20
+ ·
21
+ <a href="https://github.com/hustvl/4DLangVGGT">Xueqi Zhao<sup>1, 6</sup></a>
22
+ ·
23
+ <a href="https://github.com/hustvl/4DLangVGGT">Zhongyuan Lai<sup>1, 6</sup></a>
24
+ ·
25
+ <a href="https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en">Wenyu Liu<sup>3</sup></a>
26
+ ·
27
+ <a href="https://xwcv.github.io/">Xinggang Wang<sup>3</sup><sup>*</sup></a>
28
+ <br />
29
+ <p align="center"> <sub><sup>1</sup> <a href="https://sklpb.jhun.edu.cn/sklpben/main.htm">State Key Laboratory of Precision Blasting, Jianghan University<sup></a>, <sup>2</sup> <a href="https://wang.hms.harvard.edu/">Harvard AI and Robotics Lab, Harvard University<sup></a>, <sup>3</sup> <a href="http://english.eic.hust.edu.cn/">School of EIC, Huazhong University of Science and Technology<sup></a>, <sup>4</sup> <a href="https://www.polyu.edu.hk/comp/">Department of Computing, The Hong Kong Polytechnic University<sup></a>, <sup>5</sup> <a href="https://www.comp.hkbu.edu.hk/v1/">Department of Computer Science, Hong Kong Baptist University<sup></a>, <sup>6</sup> <a href="https://en.hbnu.edu.cn/CollegeofMathematicsandStatistics/list.htm">School of Mathematics and Statistics, Hubei University of Education<sup></a>, <sup>#</sup>Equal contribution, <sup>*</sup> Corresponding Author</sub></p>
30
+ </p>
31
+
32
+ <p align="center">
33
+ <img src="demo/demo.png" width="720">
34
+ </p>
35
+
36
+ This is a PyTorch/GPU implementation of [4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer](https://huggingface.co/papers/2512.05060).
37
+
38
+ ## Overview
39
+ 4DLangVGGT is a feed-forward framework for language-aware 4D scene understanding, combining StreamVGGT for dynamic geometry reconstruction with a Semantic Bridging Decoder (SBD) that aligns geometry tokens with language semantics. Unlike Gaussian Splatting methods that require per-scene optimization, our feed-forward design can be trained across multiple scenes and directly applied at inference, achieving scalable, efficient, and open-vocabulary 4D semantic fields with state-of-the-art performance on HyperNeRF and Neu3D benchmarks.
40
+
41
+ ## Installation
42
+
43
+ 4D LangVGGT uses the following software versions:
44
+ - Python 3.10
45
+ - CUDA 12.4
46
+
47
+ First, please clone 4DLangVGGT according to the command below.
48
+ ```bash
49
+ git clone https://github.com/hustvl/4DLangVGGT.git --single-branch
50
+ cd 4DLangVGGT
51
+ ```
52
+
53
+ Then create a conda environment using the following command:
54
+
55
+ ```bash
56
+ # if you lose some pkgs
57
+ # apt-get update && apt-get install libgl1 ffmpeg libsm6 libxext6 -y
58
+
59
+ pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
60
+
61
+ pip install -r requirements.txt
62
+ ```
63
+
64
+ ## Dataset
65
+ 4DLangVGGT is trained and evaluated on the [HyperNeRF](https://github.com/google/hypernerf) and [Neu3D](https://github.com/facebookresearch/Neural_3D_Video) datasets. Please download the datasets and put them in the folder `./data`. For data processing, please refer to [4DLangSplat](https://github.com/zrporz/4DLangSplat) to generate segmentation map and extract CLIP and Video features.
66
+
67
+
68
+ ## QuickStart
69
+ ### Download Checkpoints
70
+ Please download the checkpoint of StreamVGGT from [here](https://github.com/wzzheng/StreamVGGT) and put the checkpoint folder under `./ckpt/streamvggt`
71
+
72
+ The checkpoint of 4DLangVGGT is availavle at [Hugging Face](https://huggingface.co/YajingB/4DLangVGGT) and put the checkpoint folder under `./ckpt/4dlangvggt`
73
+
74
+ ### Inference
75
+ Run the following command to generate the demo:
76
+ ```bash
77
+ bash scripts/infer.sh
78
+ ```
79
+ The results will be saved under `./eval/eval_results`.
80
+
81
+ ## Folder Structure
82
+ The overall folder structure should be organized as follows:
83
+ ```text
84
+ 4DLangVGGT
85
+ |-- ckpt
86
+ | |-- streamvggt
87
+ | | |-- checkpoints.pth
88
+ | | |-- model.safetensors
89
+ | |-- 4dlangvggt
90
+ | | |--
91
+ |-- data
92
+ | |-- hypernerf
93
+ | | |-- americano
94
+ | | | |-- annotations
95
+ | | | | |-- train
96
+ | | | | |-- README
97
+ | | | | |-- video_annotations.json
98
+ | | | |-- camera
99
+ | | | |-- rgb
100
+ | | | | |-- 1x
101
+ | | | | | |-- 000001.png
102
+ | | | | ...
103
+ | | | | |-- 2x
104
+ | | | | | |-- 000001.png
105
+ | | | |-- streamvggt_token
106
+ | | | | | |-- 000001.npy
107
+ | | | ...
108
+ | | | |-- dataset.json
109
+ | | | |-- metadata.json
110
+ | | | |-- points.npy
111
+ | | | |-- scene.json
112
+ | | | |-- points3D_downsample2.ply
113
+ | | |-- chickchicken
114
+ | | ...
115
+ | |-- neu3d
116
+ | | |-- coffee_martini
117
+ | | | |-- annotations
118
+ | | | | |-- train
119
+ | | ...
120
+ ```
121
+
122
+ ## Training
123
+ ### Step1: Generate Geometry Tokens
124
+ To reduce the amount of memory required during training, we first preprocess the video using StreamVGGT, extract the geometry tokens, and save them in the folder `./data/<dataset>/<class>/streamvggt_token`. Take the americano class from the HyperNeRF dataset as an example, you need to ensure the extracted geometry tokens are in the folder `./data/hypernerf/americano/streamvggt_token`.
125
+ ```bash
126
+ python preprocess/generate_vggttoken.py \
127
+ --categories americano \
128
+ --img_root data/hypernerf \
129
+ --ckpt ckpt/streamvggt/checkpoints.pth \
130
+ --max_num 128 \
131
+ --device cuda
132
+ ```
133
+
134
+ ### Step2: Train 4DLangVGGT
135
+ We provide the following commands for training.
136
+ ```bash
137
+ torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 train.py --batch_size 8 \
138
+ --data_root YOUR_DATA_ROOT --streamvggt_ckpt_path YOUR_STREAMVGGT_CKPT \
139
+ --num_workers 0 --output_dir unify_hyper_clip --mode gt --cos --wandb --joint_train \
140
+ --feat_root clip_features-all_dim3 \
141
+ ```
142
+
143
+ ### 🏄 Top contributors:
144
+
145
+ <!-- <a href="https://github.com/hustvl/4DLangVGGT/graphs/contributors">
146
+ <img src="https://contrib.rocks/image?repo=hustvl/4DLangVGGT" alt="contrib.rocks image" />
147
+ </a> -->
148
+ <a href="https://github.com/hustvl/4DLangVGGT/graphs/contributors">
149
+ <img src="https://contrib.rocks/image?repo=hustvl/4DLangVGGT" />
150
+ </a>
151
+
152
+ ## Acknowledgements
153
+ Our code is based on the following brilliant repositories:
154
+
155
+ - [StreamVGGT](https://github.com/wzzheng/StreamVGGT)
156
+
157
+ - [VGGT](https://github.com/facebookresearch/vggt)
158
+
159
+ - [4DLangSplat](https://github.com/zrporz/4DLangSplat)
160
+
161
+ Many thanks to these authors!
162
+
163
+ ## License
164
+
165
+ Released under the [MIT](LICENSE) License.