| --- |
| license: mit |
| library_name: rsl_rl |
| pipeline_tag: reinforcement-learning |
| tags: |
| - reinforcement-learning |
| - robotics |
| - humanoid |
| - unitree-g1 |
| - motion-tracking |
| - locomotion |
| - mjlab |
| - mujoco-warp |
| - ppo |
| --- |
| |
| # BFM-Design Β· Proxy v2 (BeyondMimic-aligned) β Unitree G1 terrain-aware motion-tracking teacher |
|
|
| > Stage-1 **proxy agent** of the BFM-Design pipeline: a privileged PPO motion-tracking |
| > policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated |
| > reference motion across 40 terrain cells, it serves as the **teacher** for Stage-2 CVAE |
| > (BFM) DAgger distillation. |
|
|
| ## TL;DR |
|
|
| | | | |
| |---|---| |
| | Task | `Bfm-Proxy-V2-PfnnRough-Unitree-G1` (mjlab) | |
| | Robot | Unitree G1, 29 actuated DoF | |
| | Obs | **1150-dim** privileged (proprio history + per-body state + motion goal + 21Γ21 heightmap) | |
| | Action | **29-dim** joint position targets (PD), with 0β2 step action delay | |
| | Policy net | MLP `[2048, 2048, 1024, 1024, 512, 512]`, PPO (rsl_rl) | |
| | Training data | `motion_bag_v5` β 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types Γ 5 levels) | |
| | Scale | 1024 envs Γ 13 500 iters, single H20 (~15 h) | |
| | Result | ep_len ~70β87, reward ~+1.5, fall-terminations β 0 | |
| | Role | Stage-1 teacher β Stage-2 CVAE BFM distillation | |
| |
| ## Why "v2" β reward design |
| |
| This is the **v2** reward recipe. v1 used a heavy 10-penalty linear *reward curriculum* + |
| an all-14-body 0.5 m termination, which **collapsed** episode length (28 β 10) as the |
| penalties ramped β the regularizers fought limb tracking and `bad_motion_body_pos` converted |
| the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m |
| the whole time. |
|
|
| v2 reverts to the validated **BeyondMimic / mjlab-stock** minimal-shaping recipe: |
|
|
| | | weight / setting | |
| |---|---| |
| | Tracking rewards | `motion_body_pos/ori/lin_vel/ang_vel` = 1.0 each; `motion_global_root_pos/ori` = 0.5 | |
| | Penalties (only 3) | `action_rate_l2` **β0.1**, `joint_limit` β10, `self_collisions` β10 | |
| | Reward curriculum | **none (empty)** | |
| | Termination | 3-way z-only: `anchor_pos_z` 0.25 + `anchor_ori` 0.8 + `ee_body_pos_z` 0.25 (4 EE: ankles+wrists) + `motion_clip_end` (time-out) | |
|
|
| Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real |
| domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI / |
| action-delay) are kept β they are orthogonal to the v1 collapse. Full rationale + data: |
| see `docs/proxy_reward_design.md` in the source repo. |
|
|
| ## Intended use & role |
|
|
| - **Primary**: teacher policy whose privileged rollouts are distilled into the deployable |
| Stage-2 CVAE BFM (masked unified control interface) via DAgger. |
| - **Secondary**: a PHP-style "specialist on a single mode" baseline for ablation. |
| - **Not** directly deployable on hardware: observations are **privileged** (full sim state + |
| heightmap), not the 25-step proprioceptive history the deployable BFM uses. |
|
|
| ## Checkpoint contents |
|
|
| `model_13499.pt` (full rsl_rl checkpoint, ~241 MB) β keys: |
| `actor_state_dict`, `critic_state_dict`, `optimizer_state_dict`, `iter`, `infos`. |
| Use `actor_state_dict` for inference; the rest is for resuming training. |
| |
| > ONNX export is **not** included (the run's auto-export failed to serialize; a clean export |
| > can be regenerated from the actor if a deployment graph is needed). |
| |
| ## How to load (inference) |
| |
| ```python |
| import torch |
| ckpt = torch.load("model_13499.pt", map_location="cpu") |
| actor_sd = ckpt["actor_state_dict"] # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29 |
| # Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py), |
| # register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor. |
| ``` |
| |
| Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned |
| commit; the rasterized terrain is included there (`assets/terrain/mjlab_terrain_rasterized_v3h.npz`). |
| |
| ## Evaluation |
| |
| Per-terrain eval, **all 8 sub_types at level 4**, 16 env Γ 12 s (`model_13499.pt`): |
| |
| | terrain (lvl4) | track body err (m) | track joint err (rad) | |
| |---|---|---| |
| | flat | 0.033 | 0.70 | |
| | pyramid_stairs | 0.051 | 0.72 | |
| | pyramid_stairs_inv | 0.061 | 0.99 | |
| | hf_pyramid_slope | **0.073** | 0.88 | |
| | hf_pyramid_slope_inv | 0.058 | 0.99 | |
| | random_rough | 0.062 | 0.99 | |
| | wave_terrain | 0.062 | 0.89 | |
| | box_line | 0.041 | 0.77 | |
| |
| **Body tracking error 3.3β7.3 cm across all 8 lvl4 terrains** (~1 order of magnitude |
| better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m). |
| |
| Training-curve summary (tensorboard): mean reward β2.7 β +1.5; mean ep_len 6 β ~70β87; |
| `ee_body_pos` terminations 159 β ~1. Visually signed off via reference-ghost rollouts |
| (`scripts/render_v2_ghost_8terrain.py` in the source repo). |
| |
| > Note: a naive `survive_ratio` over a fixed window reads 0 because `motion_clip_end` |
| > (a successful clip completion / time-out) is counted as "done"; actual fall |
| > terminations are β 0. |
| |
| ## Limitations |
| |
| - Privileged obs β teacher-only, not sim2real-ready as-is. |
| - `ee_body_pos_z` 0.25 m termination is tight for rough terrain (slow ep_len cold-start |
| ~iter 0β1500 before the policy learns to keep feet/wrists within tolerance). |
| - Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the |
| reference distribution are out of scope (handled later by BFM residual learning). |
| |
| ## Source, data, citation |
| |
| - **License**: MIT (Β© 2026 Huiqiao Fu), consistent with [Robo-PFNN](https://github.com/tRNAoO/Robo-PFNN). |
| - **HF repo**: `tRNAoOO/<name>` β public + gated (contact-info gate), matching the Robo-PFNN weights repo. |
| - **Code (pinned)**: GitLab `hqfu/bfm-design` (internal). Reward design: `docs/proxy_reward_design.md`; data regen: `docs/data_regeneration.md`. |
| - **Base framework**: [mjlab](https://github.com/mujocolab/mjlab) v1.3.0 + MuJoCo-Warp + rsl_rl. |
| - **Reference motion**: [Robo-PFNN](https://github.com/tRNAoO/Robo-PFNN) (kinematic generator). |
| - **Reward recipe lineage**: BeyondMimic ([arXiv 2508.08241](https://arxiv.org/abs/2508.08241) / HybridRobotics/whole_body_tracking); PARC ([arXiv 2505.04002](https://arxiv.org/abs/2505.04002)); DeepMimic. |
| - **Target architecture**: BFM (arXiv 2509.13780) β CVAE + masked unified control interface. |
| |
| _Motion-bag training data (24 GB) is not distributed; regenerate deterministically per `docs/data_regeneration.md`._ |
| |