A2C PandaReach (final_model) - Mean: -0.20, Std: 0.09, Score: -0.29

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +114 -0
a2c-PandaReachDense-v3.zip +3 -0
replay.mp4 +3 -0
vec_normalize.pkl +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+replay.mp4 filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,114 @@

+---
+library_name: stable-baselines3
+tags:
+- PandaReachDense-v3
+- deep-reinforcement-learning
+- reinforcement-learning
+- stable-baselines3
+- robotics
+- panda-gym
+model-index:
+- name: A2C
+  results:
+  - task:
+      type: reinforcement-learning
+      name: reinforcement-learning
+    dataset:
+      name: PandaReachDense-v3
+      type: PandaReachDense-v3
+    metrics:
+    - type: mean_reward
+      value: -0.20 +/- 0.09
+      name: mean_reward
+      verified: false
+---
+# ✅ **A2C** Agent playing **PandaReachDense-v3**
+This is a trained model of a **A2C** agent playing **PandaReachDense-v3**
+using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3)
+and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6).
+This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.
+## 🏆 Evaluation Results
+| Metric | Value |
+|--------|-------|
+| Mean Reward | -0.20 |
+| Std Reward | 0.09 |
+| **Score (mean - std)** | **-0.29** |
+| Baseline Required | -3.5 |
+| Evaluation Episodes | 20 |
+| Status | ✅ **PASSED** |
+| Model Source | Final Model |
+## Training Configuration
+Standard training without detailed monitoring.
+## 🚀 Usage
+```python
+import gymnasium as gym
+import panda_gym
+from stable_baselines3 import A2C
+from stable_baselines3.common.env_util import make_vec_env
+from stable_baselines3.common.vec_env import VecNormalize
+# Load environment and normalization
+env = make_vec_env("PandaReachDense-v3", n_envs=1)
+env = VecNormalize.load("vec_normalize.pkl", env)
+# ⚠️ CRITICAL: disable training mode and reward normalization at test time
+env.training = False
+env.norm_reward = False
+# Load model
+model = A2C.load("a2c-PandaReachDense-v3", env=env)
+# Run inference
+obs = env.reset()
+for _ in range(1000):
+    action, _states = model.predict(obs, deterministic=True)
+    obs, reward, done, info = env.step(action)
+    if done:
+        obs = env.reset()
+```
+## 🔧 Training Configuration
+- **Algorithm**: A2C (Advantage Actor-Critic)
+- **Policy**: MultiInputPolicy (for Dict observation spaces)
+- **Environment**: PandaReachDense-v3
+- **Total Timesteps**: 200,0000
+- **Number of Parallel Envs**: 64
+- **Normalization**: VecNormalize (observation + reward)
+- **Observation Clipping**: 10.0
+- **Evaluation Frequency**: Every 500,000 steps
+- **Checkpoint Frequency**: Every 500,000 steps
+## 🤖 Model Architecture
+The agent uses a **MultiInputPolicy** because the observation space is a dictionary containing:
+- `observation`: Robot joint positions, velocities, and gripper state
+- `desired_goal`: Target position coordinates (x, y, z)
+- `achieved_goal`: Current end-effector position coordinates (x, y, z)
+The goal is to minimize the distance between `achieved_goal` and `desired_goal`.
+## 📈 Performance Notes
+- **Reward Range**: Typically from -50 (far from target) to 0 (at target)
+- **Success Criteria**: Achieving mean reward > -3.5 consistently
+- **Episode Length**: Usually 50 steps per episode
+- **Convergence**: Expect improvement after 200k-500k steps
+## 🎯 Tips for Reproduction
+1. **Normalization is Critical**: Always use VecNormalize for robotic tasks
+2. **MultiInputPolicy Required**: Dict observation spaces need special handling
+3. **Sufficient Training**: 1M+ timesteps recommended for stable performance
+4. **Evaluation**: Use deterministic=True for consistent evaluation results

a2c-PandaReachDense-v3.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3af8b2dc6c5a2cc373cce7449ac5c61eb40ee625e0ce324059ee85bd45fe56d6
+size 144629

replay.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1b60af1dcecfda6f73c306989ef311b599176539afb12382d698f5cc28c4f327
+size 342107

vec_normalize.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b1cad9f821a0e211e4133dc80fbe6f74778ed709fdcd355db2dc25bd9ff92495
+size 6169