βœ… A2C Agent playing PandaReachDense-v3

This is a trained model of a A2C agent playing PandaReachDense-v3 using the stable-baselines3 library and the Deep Reinforcement Learning Course.

This environment is part of the Panda-Gym environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.

πŸ† Evaluation Results

Metric Value
Mean Reward -0.20
Std Reward 0.09
Score (mean - std) -0.29
Baseline Required -3.5
Evaluation Episodes 20
Status βœ… PASSED
Model Source Final Model

Training Configuration

Standard training without detailed monitoring.

πŸš€ Usage

import gymnasium as gym
import panda_gym
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize

# Load environment and normalization 
env = make_vec_env("PandaReachDense-v3", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)

# ⚠️ CRITICAL: disable training mode and reward normalization at test time 
env.training = False
env.norm_reward = False

# Load model 
model = A2C.load("a2c-PandaReachDense-v3", env=env)

# Run inference
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    if done:
        obs = env.reset()

πŸ”§ Training Configuration

  • Algorithm: A2C (Advantage Actor-Critic)
  • Policy: MultiInputPolicy (for Dict observation spaces)
  • Environment: PandaReachDense-v3
  • Total Timesteps: 200,0000
  • Number of Parallel Envs: 64
  • Normalization: VecNormalize (observation + reward)
  • Observation Clipping: 10.0
  • Evaluation Frequency: Every 500,000 steps
  • Checkpoint Frequency: Every 500,000 steps

πŸ€– Model Architecture

The agent uses a MultiInputPolicy because the observation space is a dictionary containing:

  • observation: Robot joint positions, velocities, and gripper state
  • desired_goal: Target position coordinates (x, y, z)
  • achieved_goal: Current end-effector position coordinates (x, y, z)

The goal is to minimize the distance between achieved_goal and desired_goal.

πŸ“ˆ Performance Notes

  • Reward Range: Typically from -50 (far from target) to 0 (at target)
  • Success Criteria: Achieving mean reward > -3.5 consistently
  • Episode Length: Usually 50 steps per episode
  • Convergence: Expect improvement after 200k-500k steps

🎯 Tips for Reproduction

  1. Normalization is Critical: Always use VecNormalize for robotic tasks
  2. MultiInputPolicy Required: Dict observation spaces need special handling
  3. Sufficient Training: 1M+ timesteps recommended for stable performance
  4. Evaluation: Use deterministic=True for consistent evaluation results
Downloads last month
30
Video Preview
loading

Evaluation results