|
|
--- |
|
|
library_name: stable-baselines3 |
|
|
tags: |
|
|
- PandaReachDense-v3 |
|
|
- deep-reinforcement-learning |
|
|
- reinforcement-learning |
|
|
- stable-baselines3 |
|
|
- robotics |
|
|
- panda-gym |
|
|
model-index: |
|
|
- name: A2C |
|
|
results: |
|
|
- task: |
|
|
type: reinforcement-learning |
|
|
name: reinforcement-learning |
|
|
dataset: |
|
|
name: PandaReachDense-v3 |
|
|
type: PandaReachDense-v3 |
|
|
metrics: |
|
|
- type: mean_reward |
|
|
value: -0.20 +/- 0.09 |
|
|
name: mean_reward |
|
|
verified: false |
|
|
--- |
|
|
|
|
|
# β
**A2C** Agent playing **PandaReachDense-v3** |
|
|
|
|
|
This is a trained model of a **A2C** agent playing **PandaReachDense-v3** |
|
|
using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3) |
|
|
and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6). |
|
|
|
|
|
This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position. |
|
|
|
|
|
## π Evaluation Results |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Mean Reward | -0.20 | |
|
|
| Std Reward | 0.09 | |
|
|
| **Score (mean - std)** | **-0.29** | |
|
|
| Baseline Required | -3.5 | |
|
|
| Evaluation Episodes | 20 | |
|
|
| Status | β
**PASSED** | |
|
|
| Model Source | Final Model | |
|
|
|
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
Standard training without detailed monitoring. |
|
|
|
|
|
|
|
|
## π Usage |
|
|
|
|
|
```python |
|
|
import gymnasium as gym |
|
|
import panda_gym |
|
|
from stable_baselines3 import A2C |
|
|
from stable_baselines3.common.env_util import make_vec_env |
|
|
from stable_baselines3.common.vec_env import VecNormalize |
|
|
|
|
|
# Load environment and normalization |
|
|
env = make_vec_env("PandaReachDense-v3", n_envs=1) |
|
|
env = VecNormalize.load("vec_normalize.pkl", env) |
|
|
|
|
|
# β οΈ CRITICAL: disable training mode and reward normalization at test time |
|
|
env.training = False |
|
|
env.norm_reward = False |
|
|
|
|
|
# Load model |
|
|
model = A2C.load("a2c-PandaReachDense-v3", env=env) |
|
|
|
|
|
# Run inference |
|
|
obs = env.reset() |
|
|
for _ in range(1000): |
|
|
action, _states = model.predict(obs, deterministic=True) |
|
|
obs, reward, done, info = env.step(action) |
|
|
if done: |
|
|
obs = env.reset() |
|
|
``` |
|
|
|
|
|
## π§ Training Configuration |
|
|
|
|
|
- **Algorithm**: A2C (Advantage Actor-Critic) |
|
|
- **Policy**: MultiInputPolicy (for Dict observation spaces) |
|
|
- **Environment**: PandaReachDense-v3 |
|
|
- **Total Timesteps**: 200,0000 |
|
|
- **Number of Parallel Envs**: 64 |
|
|
- **Normalization**: VecNormalize (observation + reward) |
|
|
- **Observation Clipping**: 10.0 |
|
|
- **Evaluation Frequency**: Every 500,000 steps |
|
|
- **Checkpoint Frequency**: Every 500,000 steps |
|
|
|
|
|
## π€ Model Architecture |
|
|
|
|
|
The agent uses a **MultiInputPolicy** because the observation space is a dictionary containing: |
|
|
- `observation`: Robot joint positions, velocities, and gripper state |
|
|
- `desired_goal`: Target position coordinates (x, y, z) |
|
|
- `achieved_goal`: Current end-effector position coordinates (x, y, z) |
|
|
|
|
|
The goal is to minimize the distance between `achieved_goal` and `desired_goal`. |
|
|
|
|
|
## π Performance Notes |
|
|
|
|
|
- **Reward Range**: Typically from -50 (far from target) to 0 (at target) |
|
|
- **Success Criteria**: Achieving mean reward > -3.5 consistently |
|
|
- **Episode Length**: Usually 50 steps per episode |
|
|
- **Convergence**: Expect improvement after 200k-500k steps |
|
|
|
|
|
## π― Tips for Reproduction |
|
|
|
|
|
1. **Normalization is Critical**: Always use VecNormalize for robotic tasks |
|
|
2. **MultiInputPolicy Required**: Dict observation spaces need special handling |
|
|
3. **Sufficient Training**: 1M+ timesteps recommended for stable performance |
|
|
4. **Evaluation**: Use deterministic=True for consistent evaluation results |
|
|
|