SAC Unitree Go2 β€” MuJoCo Locomotion Policy

A Soft Actor-Critic (SAC) policy trained to make the Unitree Go2 quadruped walk forward in MuJoCo simulation.

Trained entirely on a MacBook (CPU, no GPU, no Isaac Gym) using strands-robots.

Results

Metric Value
Algorithm SAC (Soft Actor-Critic)
Training steps 1.74M
Training time ~40 min (MacBook M-series, CPU)
Parallel envs 8
Network MLP [256, 256]
Best reward 4,912
Mean distance 21 meters per episode
Forward velocity ~1 m/s
Episode length 1,000/1,000 (full episodes)

Demo Video

Usage

from stable_baselines3 import SAC

model = SAC.load("best/best_model")

# In a MuJoCo Go2 environment:
obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)

Reward Function

reward = forward_vel Γ— 5.0       # primary: move forward
       + alive_bonus Γ— 1.0       # stay upright
       + upright_reward Γ— 0.3    # orientation bonus
       - ctrl_cost Γ— 0.001       # minimize energy
       - lateral_penalty Γ— 0.3   # don't drift sideways
       - smoothness Γ— 0.0001     # discourage jerky motion

Why SAC > PPO

PPO (500K steps): Go2 learned to stand still. Reward = 615, distance = 0.02m. SAC (1.74M steps): Go2 walks 21 meters. Reward = 4,912.

SAC's off-policy learning + entropy regularization explores more effectively in continuous action spaces.

Files

  • best/best_model.zip β€” Best checkpoint (highest eval reward)
  • checkpoints/ β€” All 100K-step checkpoints
  • logs/evaluations.npz β€” Evaluation metrics over training
  • go2_walking.mp4 β€” Demo video

Environment

  • Simulator: MuJoCo (via mujoco-python)
  • Robot: Unitree Go2 (12 DOF) from MuJoCo Menagerie
  • Observation: joint positions, velocities, torso orientation, height (37-dim)
  • Action: joint torques (12-dim, continuous)

License

Apache-2.0

Downloads last month
416
Video Preview
loading

Evaluation results