cagataydev's picture
Add video preview embed
5c4976a verified
---
tags:
- reinforcement-learning
- robotics
- mujoco
- locomotion
- unitree
- go2
- quadruped
- sac
- stable-baselines3
- strands-robots
library_name: stable-baselines3
model-index:
- name: SAC-Unitree-Go2-MuJoCo
results:
- task:
type: reinforcement-learning
name: Quadruped Locomotion
dataset:
type: custom
name: MuJoCo LocomotionEnv
metrics:
- type: mean_reward
value: 4912
name: Best Mean Reward
- type: mean_distance
value: 21.0
name: Mean Forward Distance (m)
---
# SAC Unitree Go2 β€” MuJoCo Locomotion Policy
A **Soft Actor-Critic (SAC)** policy trained to make the Unitree Go2 quadruped **walk forward** in MuJoCo simulation.
Trained entirely on a MacBook (CPU, no GPU, no Isaac Gym) using [strands-robots](https://github.com/cagataycali/strands-gtc-nvidia).
## Results
| Metric | Value |
|--------|-------|
| Algorithm | SAC (Soft Actor-Critic) |
| Training steps | 1.74M |
| Training time | ~40 min (MacBook M-series, CPU) |
| Parallel envs | 8 |
| Network | MLP [256, 256] |
| Best reward | **4,912** |
| Mean distance | **21 meters** per episode |
| Forward velocity | ~1 m/s |
| Episode length | 1,000/1,000 (full episodes) |
## Demo Video
<video src="https://huggingface.co/cagataydev/sac-unitree-go2-mujoco/resolve/main/go2_walking.mp4" controls autoplay loop muted></video>
## Usage
```python
from stable_baselines3 import SAC
model = SAC.load("best/best_model")
# In a MuJoCo Go2 environment:
obs, _ = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, truncated, info = env.step(action)
```
## Reward Function
```
reward = forward_vel Γ— 5.0 # primary: move forward
+ alive_bonus Γ— 1.0 # stay upright
+ upright_reward Γ— 0.3 # orientation bonus
- ctrl_cost Γ— 0.001 # minimize energy
- lateral_penalty Γ— 0.3 # don't drift sideways
- smoothness Γ— 0.0001 # discourage jerky motion
```
## Why SAC > PPO
PPO (500K steps): Go2 learned to stand still. Reward = 615, distance = 0.02m.
SAC (1.74M steps): Go2 walks 21 meters. Reward = 4,912.
SAC's off-policy learning + entropy regularization explores more effectively in continuous action spaces.
## Files
- `best/best_model.zip` β€” Best checkpoint (highest eval reward)
- `checkpoints/` β€” All 100K-step checkpoints
- `logs/evaluations.npz` β€” Evaluation metrics over training
- `go2_walking.mp4` β€” Demo video
## Environment
- **Simulator**: MuJoCo (via mujoco-python)
- **Robot**: Unitree Go2 (12 DOF) from MuJoCo Menagerie
- **Observation**: joint positions, velocities, torso orientation, height (37-dim)
- **Action**: joint torques (12-dim, continuous)
## License
Apache-2.0