---
license: mit
tags:
- reinforcement-learning
- ppo
- pytorch
- isaac-lab
- robotics
- franka
library_name: pytorch
model-index:
- name: PPO-Franka-Reach
results: []
---
# PPO-Franka-Reach
A Proximal Policy Optimization (PPO) policy trained from scratch in PyTorch on the `Isaac-Reach-Franka-v0` task using NVIDIA Isaac Lab with 4096 GPU-parallel environments.
**GitHub Repository:** [DavidH2802/PPO-from-scratch](https://github.com/DavidH2802/PPO-from-scratch)
## Model Description
The model is a diagonal Gaussian policy (Actor) that controls a 7-DOF Franka Emika robot arm to reach a randomly spawned target position in 3D space. The policy outputs continuous joint-level actions.
### Architecture
- **Actor:** MLP (obs → 256 → 256 → act_dim) with Tanh activations, orthogonal initialization, and a learnable log-std parameter
- **Critic:** MLP (obs → 256 → 256 → 1) with Tanh activations and orthogonal initialization (included in checkpoint but not needed for inference)
### Observation and Action Space
- **Observations:** 32-dimensional vector (joint positions, joint velocities, end-effector position, target position)
- **Actions:** 7-dimensional continuous (joint position targets)
## Training Details
### Hyperparameters
| Parameter | Value |
|---|---|
| Task | Isaac-Reach-Franka-v0 |
| Parallel Envs | 4096 |
| Learning Rate | 3e-4 |
| Discount (γ) | 0.99 |
| GAE (λ) | 0.95 |
| Clip (ε) | 0.2 |
| Epochs per Update | 4 |
| Minibatch Size | 2048 |
| Horizon | 32 |
| Total Iterations | 500 |
| Total Env Steps | 65.5M |
| Training Time | ~48 minutes |
### Hardware
- **GPU:** NVIDIA RTX 4070 SUPER (12 GB VRAM)
- **CPU:** Intel Xeon E5-2673 v4
- **Cloud:** vast.ai
### Training Curves
#### Reward
The agent starts with negative reward (arm far from target) and converges to positive reward (~0.03-0.05) as it learns to reach the target.
#### Observation Normalization
The checkpoint includes running mean and variance statistics for observation normalization. These **must** be restored at inference time — without them, the policy receives unnormalized inputs and will not perform correctly.
## How to Use
### Download
```python
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(
repo_id="DavidH2802/PPO-from-scratch",
filename="final_policy.pt",
)
```
### Inference
Clone the full project for the model and environment code:
```bash
git clone https://github.com/DavidH2802/PPO-from-scratch.git
cd PPO-from-scratch
```
### Full Evaluation with Isaac Lab
See the [GitHub repository](https://github.com/DavidH2802/PPO-from-scratch) for complete setup instructions including Isaac Lab installation and the `eval.py` script for video recording.
## Checkpoint Contents
The `final_policy.pt` file contains:
| Key | Description |
|---|---|
| `actor` | Actor network state dict |
| `critic` | Critic network state dict |
| `obs_rms_mean` | Running mean for observation normalization |
| `obs_rms_var` | Running variance for observation normalization |
## Framework
- **Algorithm:** PPO (from scratch, no RL library dependencies)
- **Deep Learning:** PyTorch
- **Simulation:** NVIDIA Isaac Lab 2.0 / Isaac Sim 4.5
- **Environment:** Isaac-Reach-Franka-v0
## Citation
```bibtex
@misc{habinski2026ppo,
author = {David Habinski},
title = {PPO from Scratch in PyTorch with Isaac Lab},
year = {2026},
publisher = {GitHub},
url = {https://github.com/DavidH2802/PPO-from-scratch}
}
```
## License
MIT