---
library_name: stable-baselines3
tags:
- PandaReachDense-v3
- deep-reinforcement-learning
- reinforcement-learning
- stable-baselines3
- robotics
- panda-gym
model-index:
- name: A2C
  results:
  - task:
      type: reinforcement-learning
      name: reinforcement-learning
    dataset:
      name: PandaReachDense-v3
      type: PandaReachDense-v3
    metrics:
    - type: mean_reward
      value: -0.20 +/- 0.09
      name: mean_reward
      verified: false
---

# ✅ **A2C** Agent playing **PandaReachDense-v3**

This is a trained model of a **A2C** agent playing **PandaReachDense-v3**
using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3)
and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6).

This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.

## 🏆 Evaluation Results

| Metric | Value |
|--------|-------|
| Mean Reward | -0.20 |
| Std Reward | 0.09 |
| **Score (mean - std)** | **-0.29** |
| Baseline Required | -3.5 |
| Evaluation Episodes | 20 |
| Status | ✅ **PASSED** |
| Model Source | Final Model |


## Training Configuration

Standard training without detailed monitoring.


## 🚀 Usage

```python
import gymnasium as gym
import panda_gym
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize

# Load environment and normalization 
env = make_vec_env("PandaReachDense-v3", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)

# ⚠️ CRITICAL: disable training mode and reward normalization at test time 
env.training = False
env.norm_reward = False

# Load model 
model = A2C.load("a2c-PandaReachDense-v3", env=env)

# Run inference
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    if done:
        obs = env.reset()
``` 

## 🔧 Training Configuration

- **Algorithm**: A2C (Advantage Actor-Critic)
- **Policy**: MultiInputPolicy (for Dict observation spaces)
- **Environment**: PandaReachDense-v3
- **Total Timesteps**: 200,0000
- **Number of Parallel Envs**: 64
- **Normalization**: VecNormalize (observation + reward)
- **Observation Clipping**: 10.0
- **Evaluation Frequency**: Every 500,000 steps
- **Checkpoint Frequency**: Every 500,000 steps

## 🤖 Model Architecture

The agent uses a **MultiInputPolicy** because the observation space is a dictionary containing:
- `observation`: Robot joint positions, velocities, and gripper state
- `desired_goal`: Target position coordinates (x, y, z)
- `achieved_goal`: Current end-effector position coordinates (x, y, z)

The goal is to minimize the distance between `achieved_goal` and `desired_goal`.

## 📈 Performance Notes

- **Reward Range**: Typically from -50 (far from target) to 0 (at target)
- **Success Criteria**: Achieving mean reward > -3.5 consistently
- **Episode Length**: Usually 50 steps per episode
- **Convergence**: Expect improvement after 200k-500k steps

## 🎯 Tips for Reproduction

1. **Normalization is Critical**: Always use VecNormalize for robotic tasks
2. **MultiInputPolicy Required**: Dict observation spaces need special handling  
3. **Sufficient Training**: 1M+ timesteps recommended for stable performance
4. **Evaluation**: Use deterministic=True for consistent evaluation results