β A2C Agent playing PandaReachDense-v3
This is a trained model of a A2C agent playing PandaReachDense-v3 using the stable-baselines3 library and the Deep Reinforcement Learning Course.
This environment is part of the Panda-Gym environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.
π Evaluation Results
| Metric | Value |
|---|---|
| Mean Reward | -0.20 |
| Std Reward | 0.09 |
| Score (mean - std) | -0.29 |
| Baseline Required | -3.5 |
| Evaluation Episodes | 20 |
| Status | β PASSED |
| Model Source | Final Model |
Training Configuration
Standard training without detailed monitoring.
π Usage
import gymnasium as gym
import panda_gym
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize
# Load environment and normalization
env = make_vec_env("PandaReachDense-v3", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)
# β οΈ CRITICAL: disable training mode and reward normalization at test time
env.training = False
env.norm_reward = False
# Load model
model = A2C.load("a2c-PandaReachDense-v3", env=env)
# Run inference
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
if done:
obs = env.reset()
π§ Training Configuration
- Algorithm: A2C (Advantage Actor-Critic)
- Policy: MultiInputPolicy (for Dict observation spaces)
- Environment: PandaReachDense-v3
- Total Timesteps: 200,0000
- Number of Parallel Envs: 64
- Normalization: VecNormalize (observation + reward)
- Observation Clipping: 10.0
- Evaluation Frequency: Every 500,000 steps
- Checkpoint Frequency: Every 500,000 steps
π€ Model Architecture
The agent uses a MultiInputPolicy because the observation space is a dictionary containing:
observation: Robot joint positions, velocities, and gripper statedesired_goal: Target position coordinates (x, y, z)achieved_goal: Current end-effector position coordinates (x, y, z)
The goal is to minimize the distance between achieved_goal and desired_goal.
π Performance Notes
- Reward Range: Typically from -50 (far from target) to 0 (at target)
- Success Criteria: Achieving mean reward > -3.5 consistently
- Episode Length: Usually 50 steps per episode
- Convergence: Expect improvement after 200k-500k steps
π― Tips for Reproduction
- Normalization is Critical: Always use VecNormalize for robotic tasks
- MultiInputPolicy Required: Dict observation spaces need special handling
- Sufficient Training: 1M+ timesteps recommended for stable performance
- Evaluation: Use deterministic=True for consistent evaluation results
- Downloads last month
- 30
Evaluation results
- mean_reward on PandaReachDense-v3self-reported-0.20 +/- 0.09