ZGZzz
/

SAME

+---
+language:
+- en
+license: mit
+library_name: transformers
+tags:
+- vision-language
+- navigation
+- embodied-ai
+- visual-navigation
+- mixture-of-experts
+- multimodal
+- pytorch
+datasets:
+- R2R
+- REVERIE
+- RXR
+- CVDN
+- SOON
+- ObjectNav-MP3D
+metrics:
+- success_rate
+- spl
+pipeline_tag: visual-question-answering
+model-index:
+- name: SAME
+  results:
+  - task:
+      type: visual-navigation
+      name: Vision-and-Language Navigation
+    dataset:
+      type: R2R
+      name: Room-to-Room (R2R)
+    metrics:
+    - type: success_rate
+      value: 76
+      name: SR (val_unseen)
+    - type: spl
+      value: 66
+      name: SPL (val_unseen)
+    - type: success_rate
+      value: 74
+      name: SR (test_unseen)
+    - type: spl
+      value: 64
+      name: SPL (test_unseen)
+  - task:
+      type: visual-navigation
+      name: Vision-and-Language Navigation
+    dataset:
+      type: REVERIE
+      name: REVERIE
+    metrics:
+    - type: success_rate
+      value: 46.4
+      name: SR (val_unseen)
+    - type: spl
+      value: 36.1
+      name: SPL (val_unseen)
+    - type: success_rate
+      value: 48.6
+      name: SR (test_unseen)
+    - type: spl
+      value: 37.1
+      name: SPL (test_unseen)
+  - task:
+      type: visual-navigation
+      name: Multilingual VLN
+    dataset:
+      type: RXR
+      name: RxR-EN
+    metrics:
+    - type: success_rate
+      value: 50.5
+      name: SR (val_unseen)
+    - type: ndtw
+      value: 51.2
+      name: nDTW (val_unseen)
+  - task:
+      type: visual-navigation
+      name: Dialog Navigation
+    dataset:
+      type: CVDN
+      name: CVDN
+    metrics:
+    - type: goal_progress
+      value: 6.94
+      name: GP (val)
+    - type: goal_progress
+      value: 7.07
+      name: GP (test)
+  - task:
+      type: visual-navigation
+      name: Object-Oriented Navigation
+    dataset:
+      type: SOON
+      name: SOON
+    metrics:
+    - type: success_rate
+      value: 36.1
+      name: SR (val_unseen)
+    - type: spl
+      value: 25.4
+      name: SPL (val_unseen)
+    - type: success_rate
+      value: 38.2
+      name: SR (test_unseen)
+    - type: spl
+      value: 27.1
+      name: SPL (test_unseen)
+  - task:
+      type: object-navigation
+      name: Object Navigation
+    dataset:
+      type: ObjectNav-MP3D
+      name: ObjectNav-MP3D
+    metrics:
+    - type: success_rate
+      value: 76.3
+      name: SR (val)
+    - type: spl
+      value: 42.7
+      name: SPL (val)
+---
+<div align="center">
+<h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1>
+<div>
+    <a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>🍕</sup></a>;
+    <a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>🌭</sup></a>;
+    <a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>🍔</sup></a>;
+    <a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>🌮</sup></a>;
+    <a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>🍔</sup></a>;
+    <a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>🍕</sup></a>
+</div>
+<sup>🍕</sup>AIML, University of Adelaide
+<sup>🌭</sup>Adobe Research
+<sup>🍔</sup>UNC, Chapel Hill
+<sup>🌮</sup>UNSW Sydney
+<br>
+<div>
+    <a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a>
+    <a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
+    <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
+</div>
+</div>
+## Model Description
+**SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.
+### Key Features
+- **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously
+- **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations)
+- **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
+- **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions
+## Model Architecture
+SAME is built on a transformer-based architecture with the following key components:
+| Component | Description |
+|-----------|-------------|
+| **Language Encoder** | 9-layer BERT-based transformer encoder |
+| **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features |
+| **Local VP Encoder** | Viewport-level information with crossmodal fusion |
+| **Global Map Encoder** | Global spatial graph with dynamic routing |
+| **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing |
+### MoE Routing
+The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
+- The granularity of language instructions
+- Current visual observations
+- Navigation task requirements
+## Intended Uses
+### Primary Use Cases
+- **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments
+- **Object Navigation**: Finding target objects given category names
+- **Dialog-based Navigation**: Multi-turn conversational navigation
+- **Remote Object Grounding**: Navigating to and identifying remote objects
+### Supported Tasks
+| Task | Dataset | Description |
+|------|---------|-------------|
+| Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following |
+| Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects |
+| Long Horizontal VLN | RXR-EN | Long horizon navigation (English) |
+| Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation |
+| Object Search | SOON | Semantic object-oriented navigation |
+| Object Navigation | ObjectNav-MP3D | Category-based object finding |
+## How to Use
+### Installation
+```bash
+git clone https://github.com/GengzeZhou/SAME.git
+cd SAME
+conda create --name SAME python=3.10
+conda activate SAME
+pip install -r requirements.txt
+```
+### Download Data and Models
+```bash
+# Download all datasets and features
+python download.py --data
+# Download pretrained models
+python download.py --pretrain
+# Download trained checkpoints (optional)
+python download.py --checkpoints
+```
+### Training
+```bash
+cd src
+# Single GPU training
+python run.py --config_dir configs/main_multi_q.yaml
+# Multi-GPU distributed training
+torchrun --nproc_per_node=4 --master_port=29500 \
+    run.py --config_dir configs/main_multi_q.yaml
+```
+### Evaluation
+```bash
+cd src
+python run.py --config_dir configs/test.yaml \
+    --options experiment.resume_file=/path/to/checkpoint.pt
+```
+### Configuration Options
+```yaml
+model:
+  use_moe_layer: true
+  moe_type: "Task"              # Task-based MoE
+  moe_position: "Attn_q"        # Attn_q, Attn_kv, or FFN
+  task_routing_feature: "multi" # Multimodal routing (recommended)
+  num_experts: 8
+  num_experts_per_tok: 2        # Top-2 expert selection
+```
+## Training Details
+### Training Data
+SAME is trained on 9 navigation datasets with weighted sampling:
+| Dataset | Environment | Sampling Weight |
+|---------|-------------|-----------------|
+| R2R-ScaleVLN | HM3D | 10-20 |
+| R2R-PREVALENT | MP3D | 1 |
+| R2R | MP3D | 1 |
+| REVERIE-ScaleVLN | HM3D | 1-10 |
+| REVERIE | MP3D | 1 |
+| RXR-EN | MP3D | 1 |
+| CVDN | MP3D | 1 |
+| SOON | MP3D | 1 |
+| ObjectNav-MP3D | MP3D (Habitat) | 2 |
+### Training Hyperparameters
+- **Optimizer**: AdamW
+- **Learning Rate**: 1e-5
+- **Total Iterations**: 500,000
+- **Batch Size**: 16
+- **Gradient Clipping**: 0.5
+- **Training Algorithm**: DAgger (Dataset Aggregation)
+- **MoE Auxiliary Loss Coefficient**: 0.8
+### Visual Features
+- **Feature Extractor**: CLIP ViT-B/16
+- **Feature Dimension**: 512
+- **Format**: HDF5 / LMDB
+- **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D
+## Evaluation Results
+SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases.
+### Main Results (Unified Model)
+#### Room-to-Room (R2R)
+| Split | SR ↑ | SPL ↑ |
+|-------|------|-------|
+| Val Unseen | **76** | 66 |
+| Test Unseen | **74** | **64** |
+#### REVERIE
+| Split | SR ↑ | SPL ↑ |
+|-------|------|-------|
+| Val Unseen | **46.4** | **36.1** |
+| Test Unseen | **48.6** | **37.1** |
+#### RxR-EN (Multilingual VLN)
+| Split | SR ↑ | nDTW ↑ |
+|-------|------|--------|
+| Val Unseen | **50.5** | **51.2** |
+#### CVDN (Dialog Navigation)
+| Split | GP ↑ |
+|-------|------|
+| Val | **6.94** |
+| Test | 7.07 |
+#### SOON (Object-Oriented Navigation)
+| Split | SR ↑ | SPL ↑ |
+|-------|------|-------|
+| Val Unseen | 36.1 | 25.4 |
+| Test Unseen | **38.2** | **27.1** |
+#### ObjectNav-MP3D
+| Split | SR ↑ | SPL ↑ |
+|-------|------|-------|
+| Val | **76.3** | 42.7 |
+### Evaluation Metrics
+- **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal)
+- **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate
+- **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth
+- **GP (Goal Progress)**: Progress towards the goal in dialog navigation
+- **NE (Navigation Error)**: Distance to goal at episode end
+- **OSR (Oracle Success Rate)**: Success rate with oracle stop action
+## Model Variants
+| Variant | MoE Position | Routing | Checkpoint |
+|---------|--------------|---------|------------|
+| SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` |
+| SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` |
+| SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` |
+## Limitations
+- **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets
+- **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly
+- **English Language**: Primary support for English instructions (though RXR provides multilingual data)
+- **Static Environments**: Assumes static environments without dynamic obstacles or agents
+## Environmental Impact
+- **Hardware**: Training conducted on NVIDIA A100 GPUs
+- **Training Time**: Approximately 2-3 days on 4x A100 GPUs
+## Citation
+If you find this work helpful, please cite:
+```bibtex
+@article{zhou2024same,
+  title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
+  author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
+  journal={arXiv preprint arXiv:2412.05552},
+  year={2024},
+}
+```
+## Authors
+- **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io))
+- **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me))
+- **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io))
+- **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5))
+- **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/))
+- **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me))
+## Acknowledgements
+We extend our gratitude to:
+- [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform
+- [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture
+- [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data
+- [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## Contact
+For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors.