Architecture

Policy: Actor-Critic CNN + MLP (~144K parameters)
Encoder: 3-layer stride-2 CNN (64×64 depth → 128-dim embedding)
Observation: depth image (64×64) + VLM command one-hot (4-dim) + proprioception (3-dim)
Action space: Discrete 5 (forward / left / right / stop / no-op)

Training

Algorithm: PPO with GAE-λ
Steps trained: 1,998,848
Final intent-following rate: 86.6%
Reward: R_INTENT=+2.5 (follow VLM), R_INTENT_MISS=-0.5 (diverge), R_COLLISION=-10

Hyperparameters

Parameter	Value
n_rollout	2048
n_epochs	4
batch_size	256
lr	0.0003
gamma	0.99
gae_lambda	0.95
clip_eps	0.2
entropy_coef	0.01

Usage

import torch
from ppo_policy import PPOPolicy

policy = PPOPolicy()
ckpt   = torch.load("policy_update_XXXXX.pt", map_location="cpu")
policy.load_state_dict(ckpt["policy_state"])
policy.eval()

# obs: dict with keys depth (64,64), command (4,), proprioception (3,)
action, log_prob, entropy, value = policy.get_action_and_value(
    depth.unsqueeze(0),
    command.unsqueeze(0),
    prop.unsqueeze(0),
)

Dataset

Trained on wenavigatecontroller-long-episodes — HM3D minival scenes 00800–00809, 160 train episodes, 160 eval episodes.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Dataset used to train lakminaG/reactive-VLN-PPO-Controller

Collection including lakminaG/reactive-VLN-PPO-Controller

Reactive Control for Discrete-Action VLN Paper

Collection

2 items • Updated 8 days ago • 1