---
language: en
tags:
  - reinforcement-learning
  - navigation
  - habitat-sim
  - ppo
  - indoor-navigation
  - wenavigatecontroller
license: mit
---

# WeNavigate PPO Low-Level Controller

Low-level PPO controller for the [WeNavigate](https://github.com/vshwanilgv/ppo-controller)
Vision-Language Navigation system. Trained to execute VLM navigation commands
(forward / turn-left / turn-right / stop) inside Facebook Habitat-sim with HM3D scenes.

## Architecture

- **Policy**: Actor-Critic CNN + MLP (~144K parameters)
- **Encoder**: 3-layer stride-2 CNN (64×64 depth → 128-dim embedding)
- **Observation**: depth image (64×64) + VLM command one-hot (4-dim) + proprioception (3-dim)
- **Action space**: Discrete 5 (forward / left / right / stop / no-op)

## Training

- **Algorithm**: PPO with GAE-λ
- **Steps trained**: 1,998,848
- **Final intent-following rate**: 98.3%
- **Reward**: R_INTENT=+2.5 (follow VLM), R_INTENT_MISS=-0.5 (diverge), R_COLLISION=-10

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| n_rollout | 2048 |
| n_epochs | 4 |
| batch_size | 256 |
| lr | 0.0003 |
| gamma | 0.99 |
| gae_lambda | 0.95 |
| clip_eps | 0.2 |
| entropy_coef | 0.01 |


## Usage

```python
import torch
from ppo_policy import PPOPolicy

policy = PPOPolicy()
ckpt   = torch.load("policy_update_XXXXX.pt", map_location="cpu")
policy.load_state_dict(ckpt["policy_state"])
policy.eval()

# obs: dict with keys depth (64,64), command (4,), proprioception (3,)
action, log_prob, entropy, value = policy.get_action_and_value(
    depth.unsqueeze(0),
    command.unsqueeze(0),
    prop.unsqueeze(0),
)
```

## Dataset

Trained on [wenavigatecontroller-long-episodes](https://huggingface.co/datasets/vshwanilgv/wenavigatecontroller-long-episodes)
— HM3D minival scenes 00800–00809, 160 train episodes, 160 eval episodes.