--- library_name: stable-baselines3 tags: - PandaReachDense-v3 - deep-reinforcement-learning - reinforcement-learning - stable-baselines3 - robotics - panda-gym model-index: - name: A2C results: - task: type: reinforcement-learning name: reinforcement-learning dataset: name: PandaReachDense-v3 type: PandaReachDense-v3 metrics: - type: mean_reward value: -0.20 +/- 0.09 name: mean_reward verified: false --- # ✅ **A2C** Agent playing **PandaReachDense-v3** This is a trained model of a **A2C** agent playing **PandaReachDense-v3** using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3) and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6). This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position. ## 🏆 Evaluation Results | Metric | Value | |--------|-------| | Mean Reward | -0.20 | | Std Reward | 0.09 | | **Score (mean - std)** | **-0.29** | | Baseline Required | -3.5 | | Evaluation Episodes | 20 | | Status | ✅ **PASSED** | | Model Source | Final Model | ## Training Configuration Standard training without detailed monitoring. ## 🚀 Usage ```python import gymnasium as gym import panda_gym from stable_baselines3 import A2C from stable_baselines3.common.env_util import make_vec_env from stable_baselines3.common.vec_env import VecNormalize # Load environment and normalization env = make_vec_env("PandaReachDense-v3", n_envs=1) env = VecNormalize.load("vec_normalize.pkl", env) # ⚠️ CRITICAL: disable training mode and reward normalization at test time env.training = False env.norm_reward = False # Load model model = A2C.load("a2c-PandaReachDense-v3", env=env) # Run inference obs = env.reset() for _ in range(1000): action, _states = model.predict(obs, deterministic=True) obs, reward, done, info = env.step(action) if done: obs = env.reset() ``` ## 🔧 Training Configuration - **Algorithm**: A2C (Advantage Actor-Critic) - **Policy**: MultiInputPolicy (for Dict observation spaces) - **Environment**: PandaReachDense-v3 - **Total Timesteps**: 200,0000 - **Number of Parallel Envs**: 64 - **Normalization**: VecNormalize (observation + reward) - **Observation Clipping**: 10.0 - **Evaluation Frequency**: Every 500,000 steps - **Checkpoint Frequency**: Every 500,000 steps ## 🤖 Model Architecture The agent uses a **MultiInputPolicy** because the observation space is a dictionary containing: - `observation`: Robot joint positions, velocities, and gripper state - `desired_goal`: Target position coordinates (x, y, z) - `achieved_goal`: Current end-effector position coordinates (x, y, z) The goal is to minimize the distance between `achieved_goal` and `desired_goal`. ## 📈 Performance Notes - **Reward Range**: Typically from -50 (far from target) to 0 (at target) - **Success Criteria**: Achieving mean reward > -3.5 consistently - **Episode Length**: Usually 50 steps per episode - **Convergence**: Expect improvement after 200k-500k steps ## 🎯 Tips for Reproduction 1. **Normalization is Critical**: Always use VecNormalize for robotic tasks 2. **MultiInputPolicy Required**: Dict observation spaces need special handling 3. **Sufficient Training**: 1M+ timesteps recommended for stable performance 4. **Evaluation**: Use deterministic=True for consistent evaluation results