WeNavigate PPO Low-Level Controller
Low-level PPO controller for the WeNavigate Vision-Language Navigation system. Trained to execute VLM navigation commands (forward / turn-left / turn-right / stop) inside Facebook Habitat-sim with HM3D scenes.
Architecture
- Policy: Actor-Critic CNN + MLP (~144K parameters)
- Encoder: 3-layer stride-2 CNN (64×64 depth → 128-dim embedding)
- Observation: depth image (64×64) + VLM command one-hot (4-dim) + proprioception (3-dim)
- Action space: Discrete 5 (forward / left / right / stop / no-op)
Training
- Algorithm: PPO with GAE-λ
- Steps trained: 1,998,848
- Final intent-following rate: 98.3%
- Reward: R_INTENT=+2.5 (follow VLM), R_INTENT_MISS=-0.5 (diverge), R_COLLISION=-10
Hyperparameters
| Parameter | Value |
|---|---|
| n_rollout | 2048 |
| n_epochs | 4 |
| batch_size | 256 |
| lr | 0.0003 |
| gamma | 0.99 |
| gae_lambda | 0.95 |
| clip_eps | 0.2 |
| entropy_coef | 0.01 |
Usage
import torch
from ppo_policy import PPOPolicy
policy = PPOPolicy()
ckpt = torch.load("policy_update_XXXXX.pt", map_location="cpu")
policy.load_state_dict(ckpt["policy_state"])
policy.eval()
# obs: dict with keys depth (64,64), command (4,), proprioception (3,)
action, log_prob, entropy, value = policy.get_action_and_value(
depth.unsqueeze(0),
command.unsqueeze(0),
prop.unsqueeze(0),
)
Dataset
Trained on wenavigatecontroller-long-episodes — HM3D minival scenes 00800–00809, 160 train episodes, 160 eval episodes.