SAC Unitree G1 β MuJoCo Locomotion Policy
A Soft Actor-Critic (SAC) policy trained for the Unitree G1 humanoid in MuJoCo simulation. Currently learning to balance β stays upright ~4 seconds and stumbles forward.
Trained entirely on a MacBook (CPU, no GPU, no Isaac Gym) using strands-robots.
Results
| Metric | Value |
|---|---|
| Algorithm | SAC (Soft Actor-Critic) |
| Training steps | 1.91M |
| Training time | ~60 min (MacBook M-series, CPU) |
| Parallel envs | 8 |
| Network | MLP [256, 256] |
| Best reward | 530 |
| Mean distance | 2.65m |
| Episode length | |
| Status | Balancing + stumbling forward |
Demo Video
Why It's Hard
The G1 has 29 DOF vs Go2's 12. Bipedal balance is fundamentally harder β the robot must coordinate hip, knee, ankle, and torso simultaneously while maintaining a tiny support polygon.
With more training (~5-10M steps, ~3 hours), it should learn to walk.
Usage
from stable_baselines3 import SAC
model = SAC.load("best/best_model")
obs, _ = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, truncated, info = env.step(action)
Reward Function
reward = forward_vel Γ 5.0 # primary: move forward
+ alive_bonus Γ 1.0 # stay upright
+ upright_reward Γ 0.3 # orientation bonus
- ctrl_cost Γ 0.001 # minimize energy
- lateral_penalty Γ 0.3 # don't drift sideways
- smoothness Γ 0.0001 # discourage jerky motion
Files
best/best_model.zipβ Best checkpointcheckpoints/β All 100K-step checkpointslogs/evaluations.npzβ Evaluation metricsg1_balancing.mp4β Demo video
Environment
- Simulator: MuJoCo (via mujoco-python)
- Robot: Unitree G1 (29 DOF) from MuJoCo Menagerie
- Observation: joint positions, velocities, torso orientation, height (87-dim)
- Action: joint torques (29-dim, continuous)
License
Apache-2.0
- Downloads last month
- 475
Evaluation results
- Best Mean Reward on MuJoCo LocomotionEnvself-reported530.000
- Mean Forward Distance (m) on MuJoCo LocomotionEnvself-reported2.650