--- license: mit tags: - reinforcement-learning - ppo - pytorch - isaac-lab - robotics - franka library_name: pytorch model-index: - name: PPO-Franka-Reach results: [] --- # PPO-Franka-Reach A Proximal Policy Optimization (PPO) policy trained from scratch in PyTorch on the `Isaac-Reach-Franka-v0` task using NVIDIA Isaac Lab with 4096 GPU-parallel environments. **GitHub Repository:** [DavidH2802/PPO-from-scratch](https://github.com/DavidH2802/PPO-from-scratch)

Franka Reach Policy

## Model Description The model is a diagonal Gaussian policy (Actor) that controls a 7-DOF Franka Emika robot arm to reach a randomly spawned target position in 3D space. The policy outputs continuous joint-level actions. ### Architecture - **Actor:** MLP (obs → 256 → 256 → act_dim) with Tanh activations, orthogonal initialization, and a learnable log-std parameter - **Critic:** MLP (obs → 256 → 256 → 1) with Tanh activations and orthogonal initialization (included in checkpoint but not needed for inference) ### Observation and Action Space - **Observations:** 32-dimensional vector (joint positions, joint velocities, end-effector position, target position) - **Actions:** 7-dimensional continuous (joint position targets) ## Training Details ### Hyperparameters | Parameter | Value | |---|---| | Task | Isaac-Reach-Franka-v0 | | Parallel Envs | 4096 | | Learning Rate | 3e-4 | | Discount (γ) | 0.99 | | GAE (λ) | 0.95 | | Clip (ε) | 0.2 | | Epochs per Update | 4 | | Minibatch Size | 2048 | | Horizon | 32 | | Total Iterations | 500 | | Total Env Steps | 65.5M | | Training Time | ~48 minutes | ### Hardware - **GPU:** NVIDIA RTX 4070 SUPER (12 GB VRAM) - **CPU:** Intel Xeon E5-2673 v4 - **Cloud:** vast.ai ### Training Curves #### Reward The agent starts with negative reward (arm far from target) and converges to positive reward (~0.03-0.05) as it learns to reach the target. #### Observation Normalization The checkpoint includes running mean and variance statistics for observation normalization. These **must** be restored at inference time — without them, the policy receives unnormalized inputs and will not perform correctly. ## How to Use ### Download ```python from huggingface_hub import hf_hub_download checkpoint_path = hf_hub_download( repo_id="DavidH2802/PPO-from-scratch", filename="final_policy.pt", ) ``` ### Inference Clone the full project for the model and environment code: ```bash git clone https://github.com/DavidH2802/PPO-from-scratch.git cd PPO-from-scratch ``` ### Full Evaluation with Isaac Lab See the [GitHub repository](https://github.com/DavidH2802/PPO-from-scratch) for complete setup instructions including Isaac Lab installation and the `eval.py` script for video recording. ## Checkpoint Contents The `final_policy.pt` file contains: | Key | Description | |---|---| | `actor` | Actor network state dict | | `critic` | Critic network state dict | | `obs_rms_mean` | Running mean for observation normalization | | `obs_rms_var` | Running variance for observation normalization | ## Framework - **Algorithm:** PPO (from scratch, no RL library dependencies) - **Deep Learning:** PyTorch - **Simulation:** NVIDIA Isaac Lab 2.0 / Isaac Sim 4.5 - **Environment:** Isaac-Reach-Franka-v0 ## Citation ```bibtex @misc{habinski2026ppo, author = {David Habinski}, title = {PPO from Scratch in PyTorch with Isaac Lab}, year = {2026}, publisher = {GitHub}, url = {https://github.com/DavidH2802/PPO-from-scratch} } ``` ## License MIT