DavidH2802
/

PPO-from-scratch

Reinforcement Learning

Model card Files Files and versions

PPO-from-scratch / README.md

DavidH2802's picture

Update README.md

b9a1c8c verified about 1 month ago

|

history blame contribute delete

3.61 kB

	---
	license: mit
	tags:
	- reinforcement-learning
	- ppo
	- pytorch
	- isaac-lab
	- robotics
	- franka
	library_name: pytorch
	model-index:
	- name: PPO-Franka-Reach
	results: []
	---

	# PPO-Franka-Reach

	A Proximal Policy Optimization (PPO) policy trained from scratch in PyTorch on the `Isaac-Reach-Franka-v0` task using NVIDIA Isaac Lab with 4096 GPU-parallel environments.

	GitHub Repository: [DavidH2802/PPO-from-scratch](https://github.com/DavidH2802/PPO-from-scratch)

	<p align="center">
	<img src="franka_reach.gif" alt="Franka Reach Policy" width="480"/>
	</p>

	## Model Description

	The model is a diagonal Gaussian policy (Actor) that controls a 7-DOF Franka Emika robot arm to reach a randomly spawned target position in 3D space. The policy outputs continuous joint-level actions.

	### Architecture

	- Actor: MLP (obs → 256 → 256 → act_dim) with Tanh activations, orthogonal initialization, and a learnable log-std parameter
	- Critic: MLP (obs → 256 → 256 → 1) with Tanh activations and orthogonal initialization (included in checkpoint but not needed for inference)

	### Observation and Action Space

	- Observations: 32-dimensional vector (joint positions, joint velocities, end-effector position, target position)
	- Actions: 7-dimensional continuous (joint position targets)

	## Training Details

	### Hyperparameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| Task \| Isaac-Reach-Franka-v0 \|
	\| Parallel Envs \| 4096 \|
	\| Learning Rate \| 3e-4 \|
	\| Discount (γ) \| 0.99 \|
	\| GAE (λ) \| 0.95 \|
	\| Clip (ε) \| 0.2 \|
	\| Epochs per Update \| 4 \|
	\| Minibatch Size \| 2048 \|
	\| Horizon \| 32 \|
	\| Total Iterations \| 500 \|
	\| Total Env Steps \| 65.5M \|
	\| Training Time \| ~48 minutes \|

	### Hardware

	- GPU: NVIDIA RTX 4070 SUPER (12 GB VRAM)
	- CPU: Intel Xeon E5-2673 v4
	- Cloud: vast.ai

	### Training Curves

	#### Reward

	The agent starts with negative reward (arm far from target) and converges to positive reward (~0.03-0.05) as it learns to reach the target.

	#### Observation Normalization

	The checkpoint includes running mean and variance statistics for observation normalization. These must be restored at inference time — without them, the policy receives unnormalized inputs and will not perform correctly.

	## How to Use

	### Download

	```python
	from huggingface_hub import hf_hub_download

	checkpoint_path = hf_hub_download(
	repo_id="DavidH2802/PPO-from-scratch",
	filename="final_policy.pt",
	)
	```

	### Inference

	Clone the full project for the model and environment code:

	```bash
	git clone https://github.com/DavidH2802/PPO-from-scratch.git
	cd PPO-from-scratch
	```

	### Full Evaluation with Isaac Lab

	See the [GitHub repository](https://github.com/DavidH2802/PPO-from-scratch) for complete setup instructions including Isaac Lab installation and the `eval.py` script for video recording.

	## Checkpoint Contents

	The `final_policy.pt` file contains:

	\| Key \| Description \|
	\|---\|---\|
	\| `actor` \| Actor network state dict \|
	\| `critic` \| Critic network state dict \|
	\| `obs_rms_mean` \| Running mean for observation normalization \|
	\| `obs_rms_var` \| Running variance for observation normalization \|

	## Framework

	- Algorithm: PPO (from scratch, no RL library dependencies)
	- Deep Learning: PyTorch
	- Simulation: NVIDIA Isaac Lab 2.0 / Isaac Sim 4.5
	- Environment: Isaac-Reach-Franka-v0

	## Citation

	```bibtex
	@misc{habinski2026ppo,
	author = {David Habinski},
	title = {PPO from Scratch in PyTorch with Isaac Lab},
	year = {2026},
	publisher = {GitHub},
	url = {https://github.com/DavidH2802/PPO-from-scratch}
	}
	```

	## License

	MIT