File size: 3,613 Bytes

b9a1c8c

---
license: mit
tags:
  - reinforcement-learning
  - ppo
  - pytorch
  - isaac-lab
  - robotics
  - franka
library_name: pytorch
model-index:
  - name: PPO-Franka-Reach
    results: []
---

# PPO-Franka-Reach

A Proximal Policy Optimization (PPO) policy trained from scratch in PyTorch on the `Isaac-Reach-Franka-v0` task using NVIDIA Isaac Lab with 4096 GPU-parallel environments.

**GitHub Repository:** [DavidH2802/PPO-from-scratch](https://github.com/DavidH2802/PPO-from-scratch)

<p align="center">
  <img src="franka_reach.gif" alt="Franka Reach Policy" width="480"/>
</p>

## Model Description

The model is a diagonal Gaussian policy (Actor) that controls a 7-DOF Franka Emika robot arm to reach a randomly spawned target position in 3D space. The policy outputs continuous joint-level actions.

### Architecture

- **Actor:** MLP (obs → 256 → 256 → act_dim) with Tanh activations, orthogonal initialization, and a learnable log-std parameter
- **Critic:** MLP (obs → 256 → 256 → 1) with Tanh activations and orthogonal initialization (included in checkpoint but not needed for inference)

### Observation and Action Space

- **Observations:** 32-dimensional vector (joint positions, joint velocities, end-effector position, target position)
- **Actions:** 7-dimensional continuous (joint position targets)

## Training Details

### Hyperparameters

| Parameter | Value |
|---|---|
| Task | Isaac-Reach-Franka-v0 |
| Parallel Envs | 4096 |
| Learning Rate | 3e-4 |
| Discount (γ) | 0.99 |
| GAE (λ) | 0.95 |
| Clip (ε) | 0.2 |
| Epochs per Update | 4 |
| Minibatch Size | 2048 |
| Horizon | 32 |
| Total Iterations | 500 |
| Total Env Steps | 65.5M |
| Training Time | ~48 minutes |

### Hardware

- **GPU:** NVIDIA RTX 4070 SUPER (12 GB VRAM)
- **CPU:** Intel Xeon E5-2673 v4
- **Cloud:** vast.ai

### Training Curves

#### Reward

The agent starts with negative reward (arm far from target) and converges to positive reward (~0.03-0.05) as it learns to reach the target.

#### Observation Normalization

The checkpoint includes running mean and variance statistics for observation normalization. These **must** be restored at inference time — without them, the policy receives unnormalized inputs and will not perform correctly.

## How to Use

### Download

```python
from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="DavidH2802/PPO-from-scratch",
    filename="final_policy.pt",
)
```

### Inference

Clone the full project for the model and environment code:

```bash
git clone https://github.com/DavidH2802/PPO-from-scratch.git
cd PPO-from-scratch
```

### Full Evaluation with Isaac Lab

See the [GitHub repository](https://github.com/DavidH2802/PPO-from-scratch) for complete setup instructions including Isaac Lab installation and the `eval.py` script for video recording.

## Checkpoint Contents

The `final_policy.pt` file contains:

| Key | Description |
|---|---|
| `actor` | Actor network state dict |
| `critic` | Critic network state dict |
| `obs_rms_mean` | Running mean for observation normalization |
| `obs_rms_var` | Running variance for observation normalization |

## Framework

- **Algorithm:** PPO (from scratch, no RL library dependencies)
- **Deep Learning:** PyTorch
- **Simulation:** NVIDIA Isaac Lab 2.0 / Isaac Sim 4.5
- **Environment:** Isaac-Reach-Franka-v0

## Citation

```bibtex
@misc{habinski2026ppo,
  author = {David Habinski},
  title = {PPO from Scratch in PyTorch with Isaac Lab},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/DavidH2802/PPO-from-scratch}
}
```

## License

MIT