ImaghT's picture
A2C PandaReach (final_model) - Mean: -0.20, Std: 0.09, Score: -0.29
ff11ce0 verified
---
library_name: stable-baselines3
tags:
- PandaReachDense-v3
- deep-reinforcement-learning
- reinforcement-learning
- stable-baselines3
- robotics
- panda-gym
model-index:
- name: A2C
results:
- task:
type: reinforcement-learning
name: reinforcement-learning
dataset:
name: PandaReachDense-v3
type: PandaReachDense-v3
metrics:
- type: mean_reward
value: -0.20 +/- 0.09
name: mean_reward
verified: false
---
# βœ… **A2C** Agent playing **PandaReachDense-v3**
This is a trained model of a **A2C** agent playing **PandaReachDense-v3**
using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3)
and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6).
This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.
## πŸ† Evaluation Results
| Metric | Value |
|--------|-------|
| Mean Reward | -0.20 |
| Std Reward | 0.09 |
| **Score (mean - std)** | **-0.29** |
| Baseline Required | -3.5 |
| Evaluation Episodes | 20 |
| Status | βœ… **PASSED** |
| Model Source | Final Model |
## Training Configuration
Standard training without detailed monitoring.
## πŸš€ Usage
```python
import gymnasium as gym
import panda_gym
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize
# Load environment and normalization
env = make_vec_env("PandaReachDense-v3", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)
# ⚠️ CRITICAL: disable training mode and reward normalization at test time
env.training = False
env.norm_reward = False
# Load model
model = A2C.load("a2c-PandaReachDense-v3", env=env)
# Run inference
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
if done:
obs = env.reset()
```
## πŸ”§ Training Configuration
- **Algorithm**: A2C (Advantage Actor-Critic)
- **Policy**: MultiInputPolicy (for Dict observation spaces)
- **Environment**: PandaReachDense-v3
- **Total Timesteps**: 200,0000
- **Number of Parallel Envs**: 64
- **Normalization**: VecNormalize (observation + reward)
- **Observation Clipping**: 10.0
- **Evaluation Frequency**: Every 500,000 steps
- **Checkpoint Frequency**: Every 500,000 steps
## πŸ€– Model Architecture
The agent uses a **MultiInputPolicy** because the observation space is a dictionary containing:
- `observation`: Robot joint positions, velocities, and gripper state
- `desired_goal`: Target position coordinates (x, y, z)
- `achieved_goal`: Current end-effector position coordinates (x, y, z)
The goal is to minimize the distance between `achieved_goal` and `desired_goal`.
## πŸ“ˆ Performance Notes
- **Reward Range**: Typically from -50 (far from target) to 0 (at target)
- **Success Criteria**: Achieving mean reward > -3.5 consistently
- **Episode Length**: Usually 50 steps per episode
- **Convergence**: Expect improvement after 200k-500k steps
## 🎯 Tips for Reproduction
1. **Normalization is Critical**: Always use VecNormalize for robotic tasks
2. **MultiInputPolicy Required**: Dict observation spaces need special handling
3. **Sufficient Training**: 1M+ timesteps recommended for stable performance
4. **Evaluation**: Use deterministic=True for consistent evaluation results