File size: 3,524 Bytes
ff11ce0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | ---
library_name: stable-baselines3
tags:
- PandaReachDense-v3
- deep-reinforcement-learning
- reinforcement-learning
- stable-baselines3
- robotics
- panda-gym
model-index:
- name: A2C
results:
- task:
type: reinforcement-learning
name: reinforcement-learning
dataset:
name: PandaReachDense-v3
type: PandaReachDense-v3
metrics:
- type: mean_reward
value: -0.20 +/- 0.09
name: mean_reward
verified: false
---
# β
**A2C** Agent playing **PandaReachDense-v3**
This is a trained model of a **A2C** agent playing **PandaReachDense-v3**
using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3)
and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6).
This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.
## π Evaluation Results
| Metric | Value |
|--------|-------|
| Mean Reward | -0.20 |
| Std Reward | 0.09 |
| **Score (mean - std)** | **-0.29** |
| Baseline Required | -3.5 |
| Evaluation Episodes | 20 |
| Status | β
**PASSED** |
| Model Source | Final Model |
## Training Configuration
Standard training without detailed monitoring.
## π Usage
```python
import gymnasium as gym
import panda_gym
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize
# Load environment and normalization
env = make_vec_env("PandaReachDense-v3", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)
# β οΈ CRITICAL: disable training mode and reward normalization at test time
env.training = False
env.norm_reward = False
# Load model
model = A2C.load("a2c-PandaReachDense-v3", env=env)
# Run inference
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
if done:
obs = env.reset()
```
## π§ Training Configuration
- **Algorithm**: A2C (Advantage Actor-Critic)
- **Policy**: MultiInputPolicy (for Dict observation spaces)
- **Environment**: PandaReachDense-v3
- **Total Timesteps**: 200,0000
- **Number of Parallel Envs**: 64
- **Normalization**: VecNormalize (observation + reward)
- **Observation Clipping**: 10.0
- **Evaluation Frequency**: Every 500,000 steps
- **Checkpoint Frequency**: Every 500,000 steps
## π€ Model Architecture
The agent uses a **MultiInputPolicy** because the observation space is a dictionary containing:
- `observation`: Robot joint positions, velocities, and gripper state
- `desired_goal`: Target position coordinates (x, y, z)
- `achieved_goal`: Current end-effector position coordinates (x, y, z)
The goal is to minimize the distance between `achieved_goal` and `desired_goal`.
## π Performance Notes
- **Reward Range**: Typically from -50 (far from target) to 0 (at target)
- **Success Criteria**: Achieving mean reward > -3.5 consistently
- **Episode Length**: Usually 50 steps per episode
- **Convergence**: Expect improvement after 200k-500k steps
## π― Tips for Reproduction
1. **Normalization is Critical**: Always use VecNormalize for robotic tasks
2. **MultiInputPolicy Required**: Dict observation spaces need special handling
3. **Sufficient Training**: 1M+ timesteps recommended for stable performance
4. **Evaluation**: Use deterministic=True for consistent evaluation results
|