bfm-design-proxy-g1 / README.md
tRNAoOO's picture
Upload README.md with huggingface_hub
678304b verified
metadata
license: mit
library_name: rsl_rl
pipeline_tag: reinforcement-learning
tags:
  - reinforcement-learning
  - robotics
  - humanoid
  - unitree-g1
  - motion-tracking
  - locomotion
  - mjlab
  - mujoco-warp
  - ppo

BFM-Design · Proxy v2 (BeyondMimic-aligned) — Unitree G1 terrain-aware motion-tracking teacher

Stage-1 proxy agent of the BFM-Design pipeline: a privileged PPO motion-tracking policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated reference motion across 40 terrain cells, it serves as the teacher for Stage-2 CVAE (BFM) DAgger distillation.

TL;DR

Task Bfm-Proxy-V2-PfnnRough-Unitree-G1 (mjlab)
Robot Unitree G1, 29 actuated DoF
Obs 1150-dim privileged (proprio history + per-body state + motion goal + 21×21 heightmap)
Action 29-dim joint position targets (PD), with 0–2 step action delay
Policy net MLP [2048, 2048, 1024, 1024, 512, 512], PPO (rsl_rl)
Training data motion_bag_v5 — 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types × 5 levels)
Scale 1024 envs × 13 500 iters, single H20 (~15 h)
Result ep_len ~70–87, reward ~+1.5, fall-terminations ≈ 0
Role Stage-1 teacher → Stage-2 CVAE BFM distillation

Why "v2" — reward design

This is the v2 reward recipe. v1 used a heavy 10-penalty linear reward curriculum + an all-14-body 0.5 m termination, which collapsed episode length (28 → 10) as the penalties ramped — the regularizers fought limb tracking and bad_motion_body_pos converted the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m the whole time.

v2 reverts to the validated BeyondMimic / mjlab-stock minimal-shaping recipe:

weight / setting
Tracking rewards motion_body_pos/ori/lin_vel/ang_vel = 1.0 each; motion_global_root_pos/ori = 0.5
Penalties (only 3) action_rate_l2 −0.1, joint_limit −10, self_collisions −10
Reward curriculum none (empty)
Termination 3-way z-only: anchor_pos_z 0.25 + anchor_ori 0.8 + ee_body_pos_z 0.25 (4 EE: ankles+wrists) + motion_clip_end (time-out)

Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI / action-delay) are kept — they are orthogonal to the v1 collapse. Full rationale + data: see docs/proxy_reward_design.md in the source repo.

Intended use & role

  • Primary: teacher policy whose privileged rollouts are distilled into the deployable Stage-2 CVAE BFM (masked unified control interface) via DAgger.
  • Secondary: a PHP-style "specialist on a single mode" baseline for ablation.
  • Not directly deployable on hardware: observations are privileged (full sim state + heightmap), not the 25-step proprioceptive history the deployable BFM uses.

Checkpoint contents

model_13499.pt (full rsl_rl checkpoint, ~241 MB) — keys: actor_state_dict, critic_state_dict, optimizer_state_dict, iter, infos. Use actor_state_dict for inference; the rest is for resuming training.

ONNX export is not included (the run's auto-export failed to serialize; a clean export can be regenerated from the actor if a deployment graph is needed).

How to load (inference)

import torch
ckpt = torch.load("model_13499.pt", map_location="cpu")
actor_sd = ckpt["actor_state_dict"]   # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29
# Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py),
# register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor.

Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned commit; the rasterized terrain is included there (assets/terrain/mjlab_terrain_rasterized_v3h.npz).

Evaluation

Per-terrain eval, all 8 sub_types at level 4, 16 env × 12 s (model_13499.pt):

terrain (lvl4) track body err (m) track joint err (rad)
flat 0.033 0.70
pyramid_stairs 0.051 0.72
pyramid_stairs_inv 0.061 0.99
hf_pyramid_slope 0.073 0.88
hf_pyramid_slope_inv 0.058 0.99
random_rough 0.062 0.99
wave_terrain 0.062 0.89
box_line 0.041 0.77

Body tracking error 3.3–7.3 cm across all 8 lvl4 terrains (~1 order of magnitude better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m).

Training-curve summary (tensorboard): mean reward −2.7 → +1.5; mean ep_len 6 → ~70–87; ee_body_pos terminations 159 → ~1. Visually signed off via reference-ghost rollouts (scripts/render_v2_ghost_8terrain.py in the source repo).

Note: a naive survive_ratio over a fixed window reads 0 because motion_clip_end (a successful clip completion / time-out) is counted as "done"; actual fall terminations are ≈ 0.

Limitations

  • Privileged obs ⇒ teacher-only, not sim2real-ready as-is.
  • ee_body_pos_z 0.25 m termination is tight for rough terrain (slow ep_len cold-start ~iter 0–1500 before the policy learns to keep feet/wrists within tolerance).
  • Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the reference distribution are out of scope (handled later by BFM residual learning).

Source, data, citation

  • License: MIT (© 2026 Huiqiao Fu), consistent with Robo-PFNN.
  • HF repo: tRNAoOO/<name> — public + gated (contact-info gate), matching the Robo-PFNN weights repo.
  • Code (pinned): GitLab hqfu/bfm-design (internal). Reward design: docs/proxy_reward_design.md; data regen: docs/data_regeneration.md.
  • Base framework: mjlab v1.3.0 + MuJoCo-Warp + rsl_rl.
  • Reference motion: Robo-PFNN (kinematic generator).
  • Reward recipe lineage: BeyondMimic (arXiv 2508.08241 / HybridRobotics/whole_body_tracking); PARC (arXiv 2505.04002); DeepMimic.
  • Target architecture: BFM (arXiv 2509.13780) — CVAE + masked unified control interface.

Motion-bag training data (24 GB) is not distributed; regenerate deterministically per docs/data_regeneration.md.