ZGZzz commited on
Commit
49d95bb
·
verified ·
1 Parent(s): 9f11ccd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +380 -0
README.md ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: transformers
6
+ tags:
7
+ - vision-language
8
+ - navigation
9
+ - embodied-ai
10
+ - visual-navigation
11
+ - mixture-of-experts
12
+ - multimodal
13
+ - pytorch
14
+ datasets:
15
+ - R2R
16
+ - REVERIE
17
+ - RXR
18
+ - CVDN
19
+ - SOON
20
+ - ObjectNav-MP3D
21
+ metrics:
22
+ - success_rate
23
+ - spl
24
+ pipeline_tag: visual-question-answering
25
+ model-index:
26
+ - name: SAME
27
+ results:
28
+ - task:
29
+ type: visual-navigation
30
+ name: Vision-and-Language Navigation
31
+ dataset:
32
+ type: R2R
33
+ name: Room-to-Room (R2R)
34
+ metrics:
35
+ - type: success_rate
36
+ value: 76
37
+ name: SR (val_unseen)
38
+ - type: spl
39
+ value: 66
40
+ name: SPL (val_unseen)
41
+ - type: success_rate
42
+ value: 74
43
+ name: SR (test_unseen)
44
+ - type: spl
45
+ value: 64
46
+ name: SPL (test_unseen)
47
+ - task:
48
+ type: visual-navigation
49
+ name: Vision-and-Language Navigation
50
+ dataset:
51
+ type: REVERIE
52
+ name: REVERIE
53
+ metrics:
54
+ - type: success_rate
55
+ value: 46.4
56
+ name: SR (val_unseen)
57
+ - type: spl
58
+ value: 36.1
59
+ name: SPL (val_unseen)
60
+ - type: success_rate
61
+ value: 48.6
62
+ name: SR (test_unseen)
63
+ - type: spl
64
+ value: 37.1
65
+ name: SPL (test_unseen)
66
+ - task:
67
+ type: visual-navigation
68
+ name: Multilingual VLN
69
+ dataset:
70
+ type: RXR
71
+ name: RxR-EN
72
+ metrics:
73
+ - type: success_rate
74
+ value: 50.5
75
+ name: SR (val_unseen)
76
+ - type: ndtw
77
+ value: 51.2
78
+ name: nDTW (val_unseen)
79
+ - task:
80
+ type: visual-navigation
81
+ name: Dialog Navigation
82
+ dataset:
83
+ type: CVDN
84
+ name: CVDN
85
+ metrics:
86
+ - type: goal_progress
87
+ value: 6.94
88
+ name: GP (val)
89
+ - type: goal_progress
90
+ value: 7.07
91
+ name: GP (test)
92
+ - task:
93
+ type: visual-navigation
94
+ name: Object-Oriented Navigation
95
+ dataset:
96
+ type: SOON
97
+ name: SOON
98
+ metrics:
99
+ - type: success_rate
100
+ value: 36.1
101
+ name: SR (val_unseen)
102
+ - type: spl
103
+ value: 25.4
104
+ name: SPL (val_unseen)
105
+ - type: success_rate
106
+ value: 38.2
107
+ name: SR (test_unseen)
108
+ - type: spl
109
+ value: 27.1
110
+ name: SPL (test_unseen)
111
+ - task:
112
+ type: object-navigation
113
+ name: Object Navigation
114
+ dataset:
115
+ type: ObjectNav-MP3D
116
+ name: ObjectNav-MP3D
117
+ metrics:
118
+ - type: success_rate
119
+ value: 76.3
120
+ name: SR (val)
121
+ - type: spl
122
+ value: 42.7
123
+ name: SPL (val)
124
+ ---
125
+
126
+ <div align="center">
127
+
128
+ <h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1>
129
+
130
+ <div>
131
+ <a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>🍕</sup></a>;
132
+ <a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>🌭</sup></a>;
133
+ <a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>🍔</sup></a>;
134
+ <a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>🌮</sup></a>;
135
+ <a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>🍔</sup></a>;
136
+ <a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>🍕</sup></a>
137
+ </div>
138
+ <sup>🍕</sup>AIML, University of Adelaide
139
+ <sup>🌭</sup>Adobe Research
140
+ <sup>🍔</sup>UNC, Chapel Hill
141
+ <sup>🌮</sup>UNSW Sydney
142
+
143
+ <br>
144
+
145
+ <div>
146
+ <a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a>
147
+ <a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
148
+ <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
149
+ </div>
150
+
151
+ </div>
152
+
153
+ ## Model Description
154
+
155
+ **SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.
156
+
157
+ ### Key Features
158
+
159
+ - **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously
160
+ - **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations)
161
+ - **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
162
+ - **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions
163
+
164
+ ## Model Architecture
165
+
166
+ SAME is built on a transformer-based architecture with the following key components:
167
+
168
+ | Component | Description |
169
+ |-----------|-------------|
170
+ | **Language Encoder** | 9-layer BERT-based transformer encoder |
171
+ | **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features |
172
+ | **Local VP Encoder** | Viewport-level information with crossmodal fusion |
173
+ | **Global Map Encoder** | Global spatial graph with dynamic routing |
174
+ | **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing |
175
+
176
+ ### MoE Routing
177
+
178
+ The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
179
+ - The granularity of language instructions
180
+ - Current visual observations
181
+ - Navigation task requirements
182
+
183
+ ## Intended Uses
184
+
185
+ ### Primary Use Cases
186
+
187
+ - **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments
188
+ - **Object Navigation**: Finding target objects given category names
189
+ - **Dialog-based Navigation**: Multi-turn conversational navigation
190
+ - **Remote Object Grounding**: Navigating to and identifying remote objects
191
+
192
+ ### Supported Tasks
193
+
194
+ | Task | Dataset | Description |
195
+ |------|---------|-------------|
196
+ | Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following |
197
+ | Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects |
198
+ | Long Horizontal VLN | RXR-EN | Long horizon navigation (English) |
199
+ | Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation |
200
+ | Object Search | SOON | Semantic object-oriented navigation |
201
+ | Object Navigation | ObjectNav-MP3D | Category-based object finding |
202
+
203
+ ## How to Use
204
+
205
+ ### Installation
206
+
207
+ ```bash
208
+ git clone https://github.com/GengzeZhou/SAME.git
209
+ cd SAME
210
+ conda create --name SAME python=3.10
211
+ conda activate SAME
212
+ pip install -r requirements.txt
213
+ ```
214
+
215
+ ### Download Data and Models
216
+
217
+ ```bash
218
+ # Download all datasets and features
219
+ python download.py --data
220
+
221
+ # Download pretrained models
222
+ python download.py --pretrain
223
+
224
+ # Download trained checkpoints (optional)
225
+ python download.py --checkpoints
226
+ ```
227
+
228
+ ### Training
229
+
230
+ ```bash
231
+ cd src
232
+
233
+ # Single GPU training
234
+ python run.py --config_dir configs/main_multi_q.yaml
235
+
236
+ # Multi-GPU distributed training
237
+ torchrun --nproc_per_node=4 --master_port=29500 \
238
+ run.py --config_dir configs/main_multi_q.yaml
239
+ ```
240
+
241
+ ### Evaluation
242
+
243
+ ```bash
244
+ cd src
245
+ python run.py --config_dir configs/test.yaml \
246
+ --options experiment.resume_file=/path/to/checkpoint.pt
247
+ ```
248
+
249
+ ### Configuration Options
250
+
251
+ ```yaml
252
+ model:
253
+ use_moe_layer: true
254
+ moe_type: "Task" # Task-based MoE
255
+ moe_position: "Attn_q" # Attn_q, Attn_kv, or FFN
256
+ task_routing_feature: "multi" # Multimodal routing (recommended)
257
+ num_experts: 8
258
+ num_experts_per_tok: 2 # Top-2 expert selection
259
+ ```
260
+ ## Training Details
261
+ ### Training Data
262
+ SAME is trained on 9 navigation datasets with weighted sampling:
263
+ | Dataset | Environment | Sampling Weight |
264
+ |---------|-------------|-----------------|
265
+ | R2R-ScaleVLN | HM3D | 10-20 |
266
+ | R2R-PREVALENT | MP3D | 1 |
267
+ | R2R | MP3D | 1 |
268
+ | REVERIE-ScaleVLN | HM3D | 1-10 |
269
+ | REVERIE | MP3D | 1 |
270
+ | RXR-EN | MP3D | 1 |
271
+ | CVDN | MP3D | 1 |
272
+ | SOON | MP3D | 1 |
273
+ | ObjectNav-MP3D | MP3D (Habitat) | 2 |
274
+ ### Training Hyperparameters
275
+ - **Optimizer**: AdamW
276
+ - **Learning Rate**: 1e-5
277
+ - **Total Iterations**: 500,000
278
+ - **Batch Size**: 16
279
+ - **Gradient Clipping**: 0.5
280
+ - **Training Algorithm**: DAgger (Dataset Aggregation)
281
+ - **MoE Auxiliary Loss Coefficient**: 0.8
282
+ ### Visual Features
283
+ - **Feature Extractor**: CLIP ViT-B/16
284
+ - **Feature Dimension**: 512
285
+ - **Format**: HDF5 / LMDB
286
+ - **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D
287
+ ## Evaluation Results
288
+ SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases.
289
+ ### Main Results (Unified Model)
290
+ #### Room-to-Room (R2R)
291
+ | Split | SR ↑ | SPL ↑ |
292
+ |-------|------|-------|
293
+ | Val Unseen | **76** | 66 |
294
+ | Test Unseen | **74** | **64** |
295
+ #### REVERIE
296
+ | Split | SR ↑ | SPL ↑ |
297
+ |-------|------|-------|
298
+ | Val Unseen | **46.4** | **36.1** |
299
+ | Test Unseen | **48.6** | **37.1** |
300
+ #### RxR-EN (Multilingual VLN)
301
+ | Split | SR ↑ | nDTW ↑ |
302
+ |-------|------|--------|
303
+ | Val Unseen | **50.5** | **51.2** |
304
+ #### CVDN (Dialog Navigation)
305
+ | Split | GP ↑ |
306
+ |-------|------|
307
+ | Val | **6.94** |
308
+ | Test | 7.07 |
309
+ #### SOON (Object-Oriented Navigation)
310
+ | Split | SR ↑ | SPL ↑ |
311
+ |-------|------|-------|
312
+ | Val Unseen | 36.1 | 25.4 |
313
+ | Test Unseen | **38.2** | **27.1** |
314
+ #### ObjectNav-MP3D
315
+ | Split | SR ↑ | SPL ↑ |
316
+ |-------|------|-------|
317
+ | Val | **76.3** | 42.7 |
318
+ ### Evaluation Metrics
319
+ - **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal)
320
+ - **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate
321
+ - **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth
322
+ - **GP (Goal Progress)**: Progress towards the goal in dialog navigation
323
+ - **NE (Navigation Error)**: Distance to goal at episode end
324
+ - **OSR (Oracle Success Rate)**: Success rate with oracle stop action
325
+ ## Model Variants
326
+ | Variant | MoE Position | Routing | Checkpoint |
327
+ |---------|--------------|---------|------------|
328
+ | SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` |
329
+ | SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` |
330
+ | SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` |
331
+
332
+ ## Limitations
333
+
334
+ - **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets
335
+ - **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly
336
+ - **English Language**: Primary support for English instructions (though RXR provides multilingual data)
337
+ - **Static Environments**: Assumes static environments without dynamic obstacles or agents
338
+
339
+ ## Environmental Impact
340
+
341
+ - **Hardware**: Training conducted on NVIDIA A100 GPUs
342
+ - **Training Time**: Approximately 2-3 days on 4x A100 GPUs
343
+
344
+ ## Citation
345
+
346
+ If you find this work helpful, please cite:
347
+
348
+ ```bibtex
349
+ @article{zhou2024same,
350
+ title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
351
+ author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
352
+ journal={arXiv preprint arXiv:2412.05552},
353
+ year={2024},
354
+ }
355
+ ```
356
+
357
+ ## Authors
358
+
359
+ - **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io))
360
+ - **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me))
361
+ - **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io))
362
+ - **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5))
363
+ - **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/))
364
+ - **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me))
365
+
366
+ ## Acknowledgements
367
+
368
+ We extend our gratitude to:
369
+ - [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform
370
+ - [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture
371
+ - [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data
372
+ - [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights
373
+
374
+ ## License
375
+
376
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
377
+
378
+ ## Contact
379
+
380
+ For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors.