| --- |
| license: mit |
| tags: |
| - reinforcement-learning |
| - ppo |
| - pytorch |
| - isaac-lab |
| - robotics |
| - franka |
| library_name: pytorch |
| model-index: |
| - name: PPO-Franka-Reach |
| results: [] |
| --- |
| |
| # PPO-Franka-Reach |
|
|
| A Proximal Policy Optimization (PPO) policy trained from scratch in PyTorch on the `Isaac-Reach-Franka-v0` task using NVIDIA Isaac Lab with 4096 GPU-parallel environments. |
|
|
| **GitHub Repository:** [DavidH2802/PPO-from-scratch](https://github.com/DavidH2802/PPO-from-scratch) |
|
|
| <p align="center"> |
| <img src="franka_reach.gif" alt="Franka Reach Policy" width="480"/> |
| </p> |
|
|
| ## Model Description |
|
|
| The model is a diagonal Gaussian policy (Actor) that controls a 7-DOF Franka Emika robot arm to reach a randomly spawned target position in 3D space. The policy outputs continuous joint-level actions. |
|
|
| ### Architecture |
|
|
| - **Actor:** MLP (obs → 256 → 256 → act_dim) with Tanh activations, orthogonal initialization, and a learnable log-std parameter |
| - **Critic:** MLP (obs → 256 → 256 → 1) with Tanh activations and orthogonal initialization (included in checkpoint but not needed for inference) |
| |
| ### Observation and Action Space |
| |
| - **Observations:** 32-dimensional vector (joint positions, joint velocities, end-effector position, target position) |
| - **Actions:** 7-dimensional continuous (joint position targets) |
| |
| ## Training Details |
| |
| ### Hyperparameters |
| |
| | Parameter | Value | |
| |---|---| |
| | Task | Isaac-Reach-Franka-v0 | |
| | Parallel Envs | 4096 | |
| | Learning Rate | 3e-4 | |
| | Discount (γ) | 0.99 | |
| | GAE (λ) | 0.95 | |
| | Clip (ε) | 0.2 | |
| | Epochs per Update | 4 | |
| | Minibatch Size | 2048 | |
| | Horizon | 32 | |
| | Total Iterations | 500 | |
| | Total Env Steps | 65.5M | |
| | Training Time | ~48 minutes | |
| |
| ### Hardware |
| |
| - **GPU:** NVIDIA RTX 4070 SUPER (12 GB VRAM) |
| - **CPU:** Intel Xeon E5-2673 v4 |
| - **Cloud:** vast.ai |
| |
| ### Training Curves |
| |
| #### Reward |
| |
| The agent starts with negative reward (arm far from target) and converges to positive reward (~0.03-0.05) as it learns to reach the target. |
| |
| #### Observation Normalization |
| |
| The checkpoint includes running mean and variance statistics for observation normalization. These **must** be restored at inference time — without them, the policy receives unnormalized inputs and will not perform correctly. |
| |
| ## How to Use |
| |
| ### Download |
| |
| ```python |
| from huggingface_hub import hf_hub_download |
|
|
| checkpoint_path = hf_hub_download( |
| repo_id="DavidH2802/PPO-from-scratch", |
| filename="final_policy.pt", |
| ) |
| ``` |
| |
| ### Inference |
|
|
| Clone the full project for the model and environment code: |
|
|
| ```bash |
| git clone https://github.com/DavidH2802/PPO-from-scratch.git |
| cd PPO-from-scratch |
| ``` |
|
|
| ### Full Evaluation with Isaac Lab |
|
|
| See the [GitHub repository](https://github.com/DavidH2802/PPO-from-scratch) for complete setup instructions including Isaac Lab installation and the `eval.py` script for video recording. |
|
|
| ## Checkpoint Contents |
|
|
| The `final_policy.pt` file contains: |
|
|
| | Key | Description | |
| |---|---| |
| | `actor` | Actor network state dict | |
| | `critic` | Critic network state dict | |
| | `obs_rms_mean` | Running mean for observation normalization | |
| | `obs_rms_var` | Running variance for observation normalization | |
|
|
| ## Framework |
|
|
| - **Algorithm:** PPO (from scratch, no RL library dependencies) |
| - **Deep Learning:** PyTorch |
| - **Simulation:** NVIDIA Isaac Lab 2.0 / Isaac Sim 4.5 |
| - **Environment:** Isaac-Reach-Franka-v0 |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{habinski2026ppo, |
| author = {David Habinski}, |
| title = {PPO from Scratch in PyTorch with Isaac Lab}, |
| year = {2026}, |
| publisher = {GitHub}, |
| url = {https://github.com/DavidH2802/PPO-from-scratch} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT |