Upload README.md with huggingface_hub

678304b verified 5 days ago

6.41 kB

	---
	license: mit
	library_name: rsl_rl
	pipeline_tag: reinforcement-learning
	tags:
	- reinforcement-learning
	- robotics
	- humanoid
	- unitree-g1
	- motion-tracking
	- locomotion
	- mjlab
	- mujoco-warp
	- ppo
	---

	# BFM-Design · Proxy v2 (BeyondMimic-aligned) — Unitree G1 terrain-aware motion-tracking teacher

	> Stage-1 proxy agent of the BFM-Design pipeline: a privileged PPO motion-tracking
	> policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated
	> reference motion across 40 terrain cells, it serves as the teacher for Stage-2 CVAE
	> (BFM) DAgger distillation.

	## TL;DR

	\| \| \|
	\|---\|---\|
	\| Task \| `Bfm-Proxy-V2-PfnnRough-Unitree-G1` (mjlab) \|
	\| Robot \| Unitree G1, 29 actuated DoF \|
	\| Obs \| 1150-dim privileged (proprio history + per-body state + motion goal + 21×21 heightmap) \|
	\| Action \| 29-dim joint position targets (PD), with 0–2 step action delay \|
	\| Policy net \| MLP `[2048, 2048, 1024, 1024, 512, 512]`, PPO (rsl_rl) \|
	\| Training data \| `motion_bag_v5` — 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types × 5 levels) \|
	\| Scale \| 1024 envs × 13 500 iters, single H20 (~15 h) \|
	\| Result \| ep_len ~70–87, reward ~+1.5, fall-terminations ≈ 0 \|
	\| Role \| Stage-1 teacher → Stage-2 CVAE BFM distillation \|

	## Why "v2" — reward design

	This is the v2 reward recipe. v1 used a heavy 10-penalty linear reward curriculum +
	an all-14-body 0.5 m termination, which collapsed episode length (28 → 10) as the
	penalties ramped — the regularizers fought limb tracking and `bad_motion_body_pos` converted
	the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m
	the whole time.

	v2 reverts to the validated BeyondMimic / mjlab-stock minimal-shaping recipe:

	\| \| weight / setting \|
	\|---\|---\|
	\| Tracking rewards \| `motion_body_pos/ori/lin_vel/ang_vel` = 1.0 each; `motion_global_root_pos/ori` = 0.5 \|
	\| Penalties (only 3) \| `action_rate_l2` −0.1, `joint_limit` −10, `self_collisions` −10 \|
	\| Reward curriculum \| none (empty) \|
	\| Termination \| 3-way z-only: `anchor_pos_z` 0.25 + `anchor_ori` 0.8 + `ee_body_pos_z` 0.25 (4 EE: ankles+wrists) + `motion_clip_end` (time-out) \|

	Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real
	domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI /
	action-delay) are kept — they are orthogonal to the v1 collapse. Full rationale + data:
	see `docs/proxy_reward_design.md` in the source repo.

	## Intended use & role

	- Primary: teacher policy whose privileged rollouts are distilled into the deployable
	Stage-2 CVAE BFM (masked unified control interface) via DAgger.
	- Secondary: a PHP-style "specialist on a single mode" baseline for ablation.
	- Not directly deployable on hardware: observations are privileged (full sim state +
	heightmap), not the 25-step proprioceptive history the deployable BFM uses.

	## Checkpoint contents

	`model_13499.pt` (full rsl_rl checkpoint, ~241 MB) — keys:
	`actor_state_dict`, `critic_state_dict`, `optimizer_state_dict`, `iter`, `infos`.
	Use `actor_state_dict` for inference; the rest is for resuming training.

	> ONNX export is not included (the run's auto-export failed to serialize; a clean export
	> can be regenerated from the actor if a deployment graph is needed).

	## How to load (inference)

	```python
	import torch
	ckpt = torch.load("model_13499.pt", map_location="cpu")
	actor_sd = ckpt["actor_state_dict"] # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29
	# Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py),
	# register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor.
	```

	Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned
	commit; the rasterized terrain is included there (`assets/terrain/mjlab_terrain_rasterized_v3h.npz`).

	## Evaluation

	Per-terrain eval, all 8 sub_types at level 4, 16 env × 12 s (`model_13499.pt`):

	\| terrain (lvl4) \| track body err (m) \| track joint err (rad) \|
	\|---\|---\|---\|
	\| flat \| 0.033 \| 0.70 \|
	\| pyramid_stairs \| 0.051 \| 0.72 \|
	\| pyramid_stairs_inv \| 0.061 \| 0.99 \|
	\| hf_pyramid_slope \| 0.073 \| 0.88 \|
	\| hf_pyramid_slope_inv \| 0.058 \| 0.99 \|
	\| random_rough \| 0.062 \| 0.99 \|
	\| wave_terrain \| 0.062 \| 0.89 \|
	\| box_line \| 0.041 \| 0.77 \|

	Body tracking error 3.3–7.3 cm across all 8 lvl4 terrains (~1 order of magnitude
	better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m).

	Training-curve summary (tensorboard): mean reward −2.7 → +1.5; mean ep_len 6 → ~70–87;
	`ee_body_pos` terminations 159 → ~1. Visually signed off via reference-ghost rollouts
	(`scripts/render_v2_ghost_8terrain.py` in the source repo).

	> Note: a naive `survive_ratio` over a fixed window reads 0 because `motion_clip_end`
	> (a successful clip completion / time-out) is counted as "done"; actual fall
	> terminations are ≈ 0.

	## Limitations

	- Privileged obs ⇒ teacher-only, not sim2real-ready as-is.
	- `ee_body_pos_z` 0.25 m termination is tight for rough terrain (slow ep_len cold-start
	~iter 0–1500 before the policy learns to keep feet/wrists within tolerance).
	- Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the
	reference distribution are out of scope (handled later by BFM residual learning).

	## Source, data, citation

	- License: MIT (© 2026 Huiqiao Fu), consistent with [Robo-PFNN](https://github.com/tRNAoO/Robo-PFNN).
	- HF repo: `tRNAoOO/<name>` — public + gated (contact-info gate), matching the Robo-PFNN weights repo.
	- Code (pinned): GitLab `hqfu/bfm-design` (internal). Reward design: `docs/proxy_reward_design.md`; data regen: `docs/data_regeneration.md`.
	- Base framework: [mjlab](https://github.com/mujocolab/mjlab) v1.3.0 + MuJoCo-Warp + rsl_rl.
	- Reference motion: [Robo-PFNN](https://github.com/tRNAoO/Robo-PFNN) (kinematic generator).
	- Reward recipe lineage: BeyondMimic ([arXiv 2508.08241](https://arxiv.org/abs/2508.08241) / HybridRobotics/whole_body_tracking); PARC ([arXiv 2505.04002](https://arxiv.org/abs/2505.04002)); DeepMimic.
	- Target architecture: BFM (arXiv 2509.13780) — CVAE + masked unified control interface.

	_Motion-bag training data (24 GB) is not distributed; regenerate deterministically per `docs/data_regeneration.md`._