๐AIML, University of Adelaide
๐ญAdobe Research
๐UNC, Chapel Hill
๐ฎUNSW Sydney
Model Description
SAME (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both high-level category-specific search (e.g., "find a chair") and low-level language-guided navigation (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.
Key Features
Multi-Task Capability: Single model handles 9 different navigation datasets simultaneously
State-Adaptive MoE: Dynamic expert routing based on multimodal features (text + visual observations)
Simulator-Free: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
Flexible Architecture: MoE can be placed at attention query, key-value, or feed-forward network positions
Model Architecture
SAME is built on a transformer-based architecture with the following key components:
Component
Description
Language Encoder
9-layer BERT-based transformer encoder
Image Embeddings
Processes 512-dim CLIP ViT-B/16 panoramic features
Local VP Encoder
Viewport-level information with crossmodal fusion
Global Map Encoder
Global spatial graph with dynamic routing
State-Adaptive MoE
8 experts with top-2 selection, multimodal routing
MoE Routing
The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
The granularity of language instructions
Current visual observations
Navigation task requirements
Intended Uses
Primary Use Cases
Vision-and-Language Navigation (VLN): Following natural language instructions in indoor environments
Object Navigation: Finding target objects given category names
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a unified model, outperforming task-specific approaches in many cases.
Main Results (Unified Model)
Room-to-Room (R2R)
Split
SR โ
SPL โ
Val Unseen
76
66
Test Unseen
74
64
REVERIE
Split
SR โ
SPL โ
Val Unseen
46.4
36.1
Test Unseen
48.6
37.1
RxR-EN (Multilingual VLN)
Split
SR โ
nDTW โ
Val Unseen
50.5
51.2
CVDN (Dialog Navigation)
Split
GP โ
Val
6.94
Test
7.07
SOON (Object-Oriented Navigation)
Split
SR โ
SPL โ
Val Unseen
36.1
25.4
Test Unseen
38.2
27.1
ObjectNav-MP3D
Split
SR โ
SPL โ
Val
76.3
42.7
Evaluation Metrics
SR (Success Rate): Percentage of successful navigations (within 3m of goal)
SPL (Success weighted by Path Length): Efficiency-weighted success rate
nDTW (normalized Dynamic Time Warping): Path similarity to ground truth
GP (Goal Progress): Progress towards the goal in dialog navigation
NE (Navigation Error): Distance to goal at episode end
OSR (Oracle Success Rate): Success rate with oracle stop action
Model Variants
Variant
MoE Position
Routing
Checkpoint
SAME-Q
Attention Query
Multimodal
Attnq_pretrained_ckpt.pt
SAME-KV
Attention K/V
Multimodal
Attnkv_pretrained_ckpt.pt
SAME-FFN
Feed-Forward
Multimodal
FFN_pretrained_ckpt.pt
Limitations
Indoor Environments Only: Trained and evaluated on indoor navigation datasets
Pre-computed Features: Requires pre-extracted CLIP features; cannot process raw images directly
English Language: Primary support for English instructions (though RXR provides multilingual data)
Static Environments: Assumes static environments without dynamic obstacles or agents
Environmental Impact
Hardware: Training conducted on NVIDIA A100 GPUs
Training Time: Approximately 2-3 days on 4x A100 GPUs
Citation
If you find this work helpful, please cite:
@article{zhou2024same,
title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
journal={arXiv preprint arXiv:2412.05552},
year={2024},
}
Authors
Gengze Zhou - AIML, University of Adelaide (Website)