bfm-design-proxy-g1 / README.md
tRNAoOO's picture
Upload README.md with huggingface_hub
678304b verified
---
license: mit
library_name: rsl_rl
pipeline_tag: reinforcement-learning
tags:
- reinforcement-learning
- robotics
- humanoid
- unitree-g1
- motion-tracking
- locomotion
- mjlab
- mujoco-warp
- ppo
---
# BFM-Design Β· Proxy v2 (BeyondMimic-aligned) β€” Unitree G1 terrain-aware motion-tracking teacher
> Stage-1 **proxy agent** of the BFM-Design pipeline: a privileged PPO motion-tracking
> policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated
> reference motion across 40 terrain cells, it serves as the **teacher** for Stage-2 CVAE
> (BFM) DAgger distillation.
## TL;DR
| | |
|---|---|
| Task | `Bfm-Proxy-V2-PfnnRough-Unitree-G1` (mjlab) |
| Robot | Unitree G1, 29 actuated DoF |
| Obs | **1150-dim** privileged (proprio history + per-body state + motion goal + 21Γ—21 heightmap) |
| Action | **29-dim** joint position targets (PD), with 0–2 step action delay |
| Policy net | MLP `[2048, 2048, 1024, 1024, 512, 512]`, PPO (rsl_rl) |
| Training data | `motion_bag_v5` β€” 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types Γ— 5 levels) |
| Scale | 1024 envs Γ— 13 500 iters, single H20 (~15 h) |
| Result | ep_len ~70–87, reward ~+1.5, fall-terminations β‰ˆ 0 |
| Role | Stage-1 teacher β†’ Stage-2 CVAE BFM distillation |
## Why "v2" β€” reward design
This is the **v2** reward recipe. v1 used a heavy 10-penalty linear *reward curriculum* +
an all-14-body 0.5 m termination, which **collapsed** episode length (28 β†’ 10) as the
penalties ramped β€” the regularizers fought limb tracking and `bad_motion_body_pos` converted
the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m
the whole time.
v2 reverts to the validated **BeyondMimic / mjlab-stock** minimal-shaping recipe:
| | weight / setting |
|---|---|
| Tracking rewards | `motion_body_pos/ori/lin_vel/ang_vel` = 1.0 each; `motion_global_root_pos/ori` = 0.5 |
| Penalties (only 3) | `action_rate_l2` **βˆ’0.1**, `joint_limit` βˆ’10, `self_collisions` βˆ’10 |
| Reward curriculum | **none (empty)** |
| Termination | 3-way z-only: `anchor_pos_z` 0.25 + `anchor_ori` 0.8 + `ee_body_pos_z` 0.25 (4 EE: ankles+wrists) + `motion_clip_end` (time-out) |
Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real
domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI /
action-delay) are kept β€” they are orthogonal to the v1 collapse. Full rationale + data:
see `docs/proxy_reward_design.md` in the source repo.
## Intended use & role
- **Primary**: teacher policy whose privileged rollouts are distilled into the deployable
Stage-2 CVAE BFM (masked unified control interface) via DAgger.
- **Secondary**: a PHP-style "specialist on a single mode" baseline for ablation.
- **Not** directly deployable on hardware: observations are **privileged** (full sim state +
heightmap), not the 25-step proprioceptive history the deployable BFM uses.
## Checkpoint contents
`model_13499.pt` (full rsl_rl checkpoint, ~241 MB) β€” keys:
`actor_state_dict`, `critic_state_dict`, `optimizer_state_dict`, `iter`, `infos`.
Use `actor_state_dict` for inference; the rest is for resuming training.
> ONNX export is **not** included (the run's auto-export failed to serialize; a clean export
> can be regenerated from the actor if a deployment graph is needed).
## How to load (inference)
```python
import torch
ckpt = torch.load("model_13499.pt", map_location="cpu")
actor_sd = ckpt["actor_state_dict"] # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29
# Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py),
# register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor.
```
Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned
commit; the rasterized terrain is included there (`assets/terrain/mjlab_terrain_rasterized_v3h.npz`).
## Evaluation
Per-terrain eval, **all 8 sub_types at level 4**, 16 env Γ— 12 s (`model_13499.pt`):
| terrain (lvl4) | track body err (m) | track joint err (rad) |
|---|---|---|
| flat | 0.033 | 0.70 |
| pyramid_stairs | 0.051 | 0.72 |
| pyramid_stairs_inv | 0.061 | 0.99 |
| hf_pyramid_slope | **0.073** | 0.88 |
| hf_pyramid_slope_inv | 0.058 | 0.99 |
| random_rough | 0.062 | 0.99 |
| wave_terrain | 0.062 | 0.89 |
| box_line | 0.041 | 0.77 |
**Body tracking error 3.3–7.3 cm across all 8 lvl4 terrains** (~1 order of magnitude
better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m).
Training-curve summary (tensorboard): mean reward βˆ’2.7 β†’ +1.5; mean ep_len 6 β†’ ~70–87;
`ee_body_pos` terminations 159 β†’ ~1. Visually signed off via reference-ghost rollouts
(`scripts/render_v2_ghost_8terrain.py` in the source repo).
> Note: a naive `survive_ratio` over a fixed window reads 0 because `motion_clip_end`
> (a successful clip completion / time-out) is counted as "done"; actual fall
> terminations are β‰ˆ 0.
## Limitations
- Privileged obs β‡’ teacher-only, not sim2real-ready as-is.
- `ee_body_pos_z` 0.25 m termination is tight for rough terrain (slow ep_len cold-start
~iter 0–1500 before the policy learns to keep feet/wrists within tolerance).
- Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the
reference distribution are out of scope (handled later by BFM residual learning).
## Source, data, citation
- **License**: MIT (Β© 2026 Huiqiao Fu), consistent with [Robo-PFNN](https://github.com/tRNAoO/Robo-PFNN).
- **HF repo**: `tRNAoOO/<name>` β€” public + gated (contact-info gate), matching the Robo-PFNN weights repo.
- **Code (pinned)**: GitLab `hqfu/bfm-design` (internal). Reward design: `docs/proxy_reward_design.md`; data regen: `docs/data_regeneration.md`.
- **Base framework**: [mjlab](https://github.com/mujocolab/mjlab) v1.3.0 + MuJoCo-Warp + rsl_rl.
- **Reference motion**: [Robo-PFNN](https://github.com/tRNAoO/Robo-PFNN) (kinematic generator).
- **Reward recipe lineage**: BeyondMimic ([arXiv 2508.08241](https://arxiv.org/abs/2508.08241) / HybridRobotics/whole_body_tracking); PARC ([arXiv 2505.04002](https://arxiv.org/abs/2505.04002)); DeepMimic.
- **Target architecture**: BFM (arXiv 2509.13780) β€” CVAE + masked unified control interface.
_Motion-bag training data (24 GB) is not distributed; regenerate deterministically per `docs/data_regeneration.md`._