Upload README.md with huggingface_hub

678304b verified 5 days ago

6.41 kB

license: mit
library_name: rsl_rl
pipeline_tag: reinforcement-learning
tags:
  - reinforcement-learning
  - robotics
  - humanoid
  - unitree-g1
  - motion-tracking
  - locomotion
  - mjlab
  - mujoco-warp
  - ppo

BFM-Design · Proxy v2 (BeyondMimic-aligned) — Unitree G1 terrain-aware motion-tracking teacher

Stage-1 proxy agent of the BFM-Design pipeline: a privileged PPO motion-tracking policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated reference motion across 40 terrain cells, it serves as the teacher for Stage-2 CVAE (BFM) DAgger distillation.

TL;DR


Task	`Bfm-Proxy-V2-PfnnRough-Unitree-G1` (mjlab)
Robot	Unitree G1, 29 actuated DoF
Obs	1150-dim privileged (proprio history + per-body state + motion goal + 21×21 heightmap)
Action	29-dim joint position targets (PD), with 0–2 step action delay
Policy net	MLP `[2048, 2048, 1024, 1024, 512, 512]`, PPO (rsl_rl)
Training data	`motion_bag_v5` — 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types × 5 levels)
Scale	1024 envs × 13 500 iters, single H20 (~15 h)
Result	ep_len ~70–87, reward ~+1.5, fall-terminations ≈ 0
Role	Stage-1 teacher → Stage-2 CVAE BFM distillation

Why "v2" — reward design

This is the v2 reward recipe. v1 used a heavy 10-penalty linear reward curriculum + an all-14-body 0.5 m termination, which collapsed episode length (28 → 10) as the penalties ramped — the regularizers fought limb tracking and bad_motion_body_pos converted the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m the whole time.

v2 reverts to the validated BeyondMimic / mjlab-stock minimal-shaping recipe:

	weight / setting
Tracking rewards	`motion_body_pos/ori/lin_vel/ang_vel` = 1.0 each; `motion_global_root_pos/ori` = 0.5
Penalties (only 3)	`action_rate_l2` −0.1, `joint_limit` −10, `self_collisions` −10
Reward curriculum	none (empty)
Termination	3-way z-only: `anchor_pos_z` 0.25 + `anchor_ori` 0.8 + `ee_body_pos_z` 0.25 (4 EE: ankles+wrists) + `motion_clip_end` (time-out)

Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI / action-delay) are kept — they are orthogonal to the v1 collapse. Full rationale + data: see docs/proxy_reward_design.md in the source repo.

Intended use & role

Primary: teacher policy whose privileged rollouts are distilled into the deployable Stage-2 CVAE BFM (masked unified control interface) via DAgger.
Secondary: a PHP-style "specialist on a single mode" baseline for ablation.
Not directly deployable on hardware: observations are privileged (full sim state + heightmap), not the 25-step proprioceptive history the deployable BFM uses.

Checkpoint contents

model_13499.pt (full rsl_rl checkpoint, ~241 MB) — keys: actor_state_dict, critic_state_dict, optimizer_state_dict, iter, infos. Use actor_state_dict for inference; the rest is for resuming training.

ONNX export is not included (the run's auto-export failed to serialize; a clean export can be regenerated from the actor if a deployment graph is needed).

How to load (inference)

import torch
ckpt = torch.load("model_13499.pt", map_location="cpu")
actor_sd = ckpt["actor_state_dict"]   # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29
# Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py),
# register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor.

Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned commit; the rasterized terrain is included there (assets/terrain/mjlab_terrain_rasterized_v3h.npz).

Evaluation

Per-terrain eval, all 8 sub_types at level 4, 16 env × 12 s (model_13499.pt):

terrain (lvl4)	track body err (m)	track joint err (rad)
flat	0.033	0.70
pyramid_stairs	0.051	0.72
pyramid_stairs_inv	0.061	0.99
hf_pyramid_slope	0.073	0.88
hf_pyramid_slope_inv	0.058	0.99
random_rough	0.062	0.99
wave_terrain	0.062	0.89
box_line	0.041	0.77

Body tracking error 3.3–7.3 cm across all 8 lvl4 terrains (~1 order of magnitude better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m).

Training-curve summary (tensorboard): mean reward −2.7 → +1.5; mean ep_len 6 → ~70–87; ee_body_pos terminations 159 → ~1. Visually signed off via reference-ghost rollouts (scripts/render_v2_ghost_8terrain.py in the source repo).

Note: a naive survive_ratio over a fixed window reads 0 because motion_clip_end (a successful clip completion / time-out) is counted as "done"; actual fall terminations are ≈ 0.

Limitations

Privileged obs ⇒ teacher-only, not sim2real-ready as-is.
ee_body_pos_z 0.25 m termination is tight for rough terrain (slow ep_len cold-start ~iter 0–1500 before the policy learns to keep feet/wrists within tolerance).
Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the reference distribution are out of scope (handled later by BFM residual learning).

Source, data, citation

HF repo: tRNAoOO/<name> — public + gated (contact-info gate), matching the Robo-PFNN weights repo.
Code (pinned): GitLab hqfu/bfm-design (internal). Reward design: docs/proxy_reward_design.md; data regen: docs/data_regeneration.md.
Base framework: mjlab v1.3.0 + MuJoCo-Warp + rsl_rl.
Reference motion: Robo-PFNN (kinematic generator).
Reward recipe lineage: BeyondMimic (arXiv 2508.08241 / HybridRobotics/whole_body_tracking); PARC (arXiv 2505.04002); DeepMimic.
Target architecture: BFM (arXiv 2509.13780) — CVAE + masked unified control interface.

Motion-bag training data (24 GB) is not distributed; regenerate deterministically per docs/data_regeneration.md.