ImaghT commited on
Commit
ff11ce0
Β·
verified Β·
1 Parent(s): ba6bff9

A2C PandaReach (final_model) - Mean: -0.20, Std: 0.09, Score: -0.29

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +114 -0
  3. a2c-PandaReachDense-v3.zip +3 -0
  4. replay.mp4 +3 -0
  5. vec_normalize.pkl +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ replay.mp4 filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: stable-baselines3
3
+ tags:
4
+ - PandaReachDense-v3
5
+ - deep-reinforcement-learning
6
+ - reinforcement-learning
7
+ - stable-baselines3
8
+ - robotics
9
+ - panda-gym
10
+ model-index:
11
+ - name: A2C
12
+ results:
13
+ - task:
14
+ type: reinforcement-learning
15
+ name: reinforcement-learning
16
+ dataset:
17
+ name: PandaReachDense-v3
18
+ type: PandaReachDense-v3
19
+ metrics:
20
+ - type: mean_reward
21
+ value: -0.20 +/- 0.09
22
+ name: mean_reward
23
+ verified: false
24
+ ---
25
+
26
+ # βœ… **A2C** Agent playing **PandaReachDense-v3**
27
+
28
+ This is a trained model of a **A2C** agent playing **PandaReachDense-v3**
29
+ using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3)
30
+ and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit6).
31
+
32
+ This environment is part of the [Panda-Gym](https://github.com/qgallouedec/panda-gym) environments and includes robotic manipulation tasks where the robot arm needs to reach a target position.
33
+
34
+ ## πŸ† Evaluation Results
35
+
36
+ | Metric | Value |
37
+ |--------|-------|
38
+ | Mean Reward | -0.20 |
39
+ | Std Reward | 0.09 |
40
+ | **Score (mean - std)** | **-0.29** |
41
+ | Baseline Required | -3.5 |
42
+ | Evaluation Episodes | 20 |
43
+ | Status | βœ… **PASSED** |
44
+ | Model Source | Final Model |
45
+
46
+
47
+ ## Training Configuration
48
+
49
+ Standard training without detailed monitoring.
50
+
51
+
52
+ ## πŸš€ Usage
53
+
54
+ ```python
55
+ import gymnasium as gym
56
+ import panda_gym
57
+ from stable_baselines3 import A2C
58
+ from stable_baselines3.common.env_util import make_vec_env
59
+ from stable_baselines3.common.vec_env import VecNormalize
60
+
61
+ # Load environment and normalization
62
+ env = make_vec_env("PandaReachDense-v3", n_envs=1)
63
+ env = VecNormalize.load("vec_normalize.pkl", env)
64
+
65
+ # ⚠️ CRITICAL: disable training mode and reward normalization at test time
66
+ env.training = False
67
+ env.norm_reward = False
68
+
69
+ # Load model
70
+ model = A2C.load("a2c-PandaReachDense-v3", env=env)
71
+
72
+ # Run inference
73
+ obs = env.reset()
74
+ for _ in range(1000):
75
+ action, _states = model.predict(obs, deterministic=True)
76
+ obs, reward, done, info = env.step(action)
77
+ if done:
78
+ obs = env.reset()
79
+ ```
80
+
81
+ ## πŸ”§ Training Configuration
82
+
83
+ - **Algorithm**: A2C (Advantage Actor-Critic)
84
+ - **Policy**: MultiInputPolicy (for Dict observation spaces)
85
+ - **Environment**: PandaReachDense-v3
86
+ - **Total Timesteps**: 200,0000
87
+ - **Number of Parallel Envs**: 64
88
+ - **Normalization**: VecNormalize (observation + reward)
89
+ - **Observation Clipping**: 10.0
90
+ - **Evaluation Frequency**: Every 500,000 steps
91
+ - **Checkpoint Frequency**: Every 500,000 steps
92
+
93
+ ## πŸ€– Model Architecture
94
+
95
+ The agent uses a **MultiInputPolicy** because the observation space is a dictionary containing:
96
+ - `observation`: Robot joint positions, velocities, and gripper state
97
+ - `desired_goal`: Target position coordinates (x, y, z)
98
+ - `achieved_goal`: Current end-effector position coordinates (x, y, z)
99
+
100
+ The goal is to minimize the distance between `achieved_goal` and `desired_goal`.
101
+
102
+ ## πŸ“ˆ Performance Notes
103
+
104
+ - **Reward Range**: Typically from -50 (far from target) to 0 (at target)
105
+ - **Success Criteria**: Achieving mean reward > -3.5 consistently
106
+ - **Episode Length**: Usually 50 steps per episode
107
+ - **Convergence**: Expect improvement after 200k-500k steps
108
+
109
+ ## 🎯 Tips for Reproduction
110
+
111
+ 1. **Normalization is Critical**: Always use VecNormalize for robotic tasks
112
+ 2. **MultiInputPolicy Required**: Dict observation spaces need special handling
113
+ 3. **Sufficient Training**: 1M+ timesteps recommended for stable performance
114
+ 4. **Evaluation**: Use deterministic=True for consistent evaluation results
a2c-PandaReachDense-v3.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3af8b2dc6c5a2cc373cce7449ac5c61eb40ee625e0ce324059ee85bd45fe56d6
3
+ size 144629
replay.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b60af1dcecfda6f73c306989ef311b599176539afb12382d698f5cc28c4f327
3
+ size 342107
vec_normalize.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1cad9f821a0e211e4133dc80fbe6f74778ed709fdcd355db2dc25bd9ff92495
3
+ size 6169