A2C PandaReach (final_model) - Mean: -0.20, Std: 0.09, Score: -0.29

ff11ce0 verified 26 days ago

3.52 kB

	---
	library_name: stable-baselines3
	tags:
	- PandaReachDense-v3
	- deep-reinforcement-learning
	- reinforcement-learning
	- stable-baselines3
	- robotics
	- panda-gym
	model-index:
	- name: A2C
	results:
	- task:
	type: reinforcement-learning
	name: reinforcement-learning
	dataset:
	name: PandaReachDense-v3
	type: PandaReachDense-v3
	metrics:
	- type: mean_reward
	value: -0.20 +/- 0.09
	name: mean_reward
	verified: false
	---

	# ✅ A2C Agent playing PandaReachDense-v3

	This is a trained model of a A2C agent playing PandaReachDense-v3
	using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3)
	and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6).

	This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.

	## 🏆 Evaluation Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Mean Reward \| -0.20 \|
	\| Std Reward \| 0.09 \|
	\| Score (mean - std) \| -0.29 \|
	\| Baseline Required \| -3.5 \|
	\| Evaluation Episodes \| 20 \|
	\| Status \| ✅ PASSED \|
	\| Model Source \| Final Model \|


	## Training Configuration

	Standard training without detailed monitoring.


	## 🚀 Usage

	```python
	import gymnasium as gym
	import panda_gym
	from stable_baselines3 import A2C
	from stable_baselines3.common.env_util import make_vec_env
	from stable_baselines3.common.vec_env import VecNormalize

	# Load environment and normalization
	env = make_vec_env("PandaReachDense-v3", n_envs=1)
	env = VecNormalize.load("vec_normalize.pkl", env)

	# ⚠️ CRITICAL: disable training mode and reward normalization at test time
	env.training = False
	env.norm_reward = False

	# Load model
	model = A2C.load("a2c-PandaReachDense-v3", env=env)

	# Run inference
	obs = env.reset()
	for _ in range(1000):
	action, _states = model.predict(obs, deterministic=True)
	obs, reward, done, info = env.step(action)
	if done:
	obs = env.reset()
	```

	## 🔧 Training Configuration

	- Algorithm: A2C (Advantage Actor-Critic)
	- Policy: MultiInputPolicy (for Dict observation spaces)
	- Environment: PandaReachDense-v3
	- Total Timesteps: 200,0000
	- Number of Parallel Envs: 64
	- Normalization: VecNormalize (observation + reward)
	- Observation Clipping: 10.0
	- Evaluation Frequency: Every 500,000 steps
	- Checkpoint Frequency: Every 500,000 steps

	## 🤖 Model Architecture

	The agent uses a MultiInputPolicy because the observation space is a dictionary containing:
	- `observation`: Robot joint positions, velocities, and gripper state
	- `desired_goal`: Target position coordinates (x, y, z)
	- `achieved_goal`: Current end-effector position coordinates (x, y, z)

	The goal is to minimize the distance between `achieved_goal` and `desired_goal`.

	## 📈 Performance Notes

	- Reward Range: Typically from -50 (far from target) to 0 (at target)
	- Success Criteria: Achieving mean reward > -3.5 consistently
	- Episode Length: Usually 50 steps per episode
	- Convergence: Expect improvement after 200k-500k steps

	## 🎯 Tips for Reproduction

	1. Normalization is Critical: Always use VecNormalize for robotic tasks
	2. MultiInputPolicy Required: Dict observation spaces need special handling
	3. Sufficient Training: 1M+ timesteps recommended for stable performance
	4. Evaluation: Use deterministic=True for consistent evaluation results