Upload README.md with huggingface_hub

91ecef7 verified 7 days ago

6.84 kB

	---
	library_name: drq
	tags:
	- reinforcement-learning
	- humanoid
	- mujoco
	- humanoid-bench
	- locomotion
	- unitree-h1
	- unitree-g1
	datasets:
	- carlosferrazza/humanoid-bench
	base_model: dmux/DR.Q
	license: mit
	---

	# HumanoidBench-DR.Q · 自训通关 checkpoints

	_Self-trained DR.Q checkpoints that beat the public dmux/DR.Q baseline on HumanoidBench locomotion tasks._

	> 🛠 训练源码 / Training source: <https://github.com/vitorcen/humanoid-training>
	> 完整训练脚本、patches、eval harness、分析文档全在 GitHub 配套仓库。
	> _Full training scripts, patches, eval harness, and analysis docs in the companion GitHub repo._

	DR.Q 是 TD3 + model-based 表征学习的离策略 RL 算法（encoder + policy ~13 MB 推理）。
	本仓库收录在 [HumanoidBench](https://github.com/carlosferrazza/humanoid-bench) 上从零自训通关的 checkpoints。

	_DR.Q is a TD3-family off-policy RL algorithm with model-based representation learning (~13 MB encoder + policy at inference).
	This repo hosts checkpoints trained from scratch that pass the HumanoidBench locomotion bar._

	---

	## 📊 性能 / Performance

	\| Task \| success_rate \| mean_return \| N \| 对比公开 baseline \|
	\|---\|---\|---\|---\|---\|
	\| `h1-walk-v0` \| 90% \| 801.05 \| 10 ep × seed 0 \| dmux/DR.Q seed 0: ~30% / mean ~530 \|
	\| `g1-walk-v0` \| 70% \| 710.52 \| 10 ep × seed 0 \| torque baseline: 0% / mean ~100 (7.1× 提升) \|

	`success_bar = 700` (HumanoidBench locomotion threshold).
	_All numbers from deterministic eval with `action_repeat=2`. Raw JSONL in `eval/`._

	---

	## 🎬 演示 / Demos

	### H1-walk-v0 (Unitree H1, 19 DoF)

	<video controls width="720" src="https://huggingface.co/wsagi/HumanoidBench-DR.Q/resolve/main/assets/drq-h1-walk.mp4"></video>

	### G1-walk-v0 (Unitree G1, 23 DoF with BlockedHands wrapper)

	<video controls width="720" src="https://huggingface.co/wsagi/HumanoidBench-DR.Q/resolve/main/assets/drq-g1-walk.mp4"></video>

	---

	## 📦 仓库内容 / Repo layout

	```
	HumanoidBench-DR.Q/
	├── DRQ+HBench-h1-walk-v0+0/ # H1-walk self-trained ckpt (76 MB)
	│ ├── encoder.pt policy.pt agent_var.npy ← inference (~13 MB)
	│ ├── *_target.pt × 3 ← Q-learning targets
	│ ├── *_optimizer.pt × 3 ← Adam states (resume)
	│ ├── value.pt ← critic
	│ └── exp_var.npy ← exploration variance
	├── DRQ+HBench-g1-walk-v0+0/ # G1-walk self-trained ckpt (76 MB)
	│ └── ... (same 11-file layout)
	├── eval/ # Final eval JSONL (per-episode + summary row)
	└── assets/ # MP4 demos
	```

	推理只需 3 个文件：`encoder.pt` + `policy.pt` + `agent_var.npy`（共 ~13 MB）。
	其余 8 个文件用于续训与 Q-learning target。

	_Only 3 files are needed for inference; the rest are for resume-training and Q-learning targets._

	---

	## 🚀 加载与推理 / Load & inference

	```python
	# Minimal inference loader — see scripts/drq_viewer.py in companion repo
	import torch, numpy as np
	from huggingface_hub import snapshot_download

	ckpt_dir = snapshot_download(
	repo_id="wsagi/HumanoidBench-DR.Q",
	allow_patterns="DRQ+HBench-h1-walk-v0+0/*",
	)

	var = np.load(f"{ckpt_dir}/DRQ+HBench-h1-walk-v0+0/agent_var.npy", allow_pickle=True).item()
	# encoder = Encoder(obs_dim, ...); encoder.load_state_dict(torch.load(.../encoder.pt))
	# policy = Policy(...); policy.load_state_dict(torch.load(.../policy.pt))
	```

	完整加载链路参考配套仓库 [vitorcen/humanoid-training](https://github.com/vitorcen/humanoid-training) 的 `scripts/drq_viewer.py`。

	```bash
	git clone --recursive https://github.com/vitorcen/humanoid-training
	cd humanoid-training
	bash patches/apply.sh # apply DR.Q + HumanoidBench local patches
	DISPLAY=:0 python scripts/drq_viewer.py --task h1-walk-v0 --seed 0
	DISPLAY=:0 python scripts/drq_viewer.py --task g1-walk-v0 --seed 0
	```

	---

	## ⚠️ G1-walk 必备 patches / Required patches for G1

	G1 通关不是开箱即用，需两层 patch（详见 [g1_training_strategies.html](https://github.com/vitorcen/humanoid-training/blob/main/docs/g1_training_strategies.html)）：

	\| Patch \| 作用 \|
	\|---\|---\|
	\| `patches/g1-pos-control.patch` \| G1 默认 torque control → PD position control（与 H1 一致），sample efficiency 4×↑ \|
	\| `patches/humanoid-bench-g1-blocked-hands.patch` \| 扩展 `BlockedHandsLocoWrapper` 支持 G1，屏蔽 14 维手指（37D → 23D action），避免噪声污染 encoder dynamics loss \|

	_G1 raw torque baseline trained 1M steps and stayed at 0% / mean ~100. The combined patches lift it to 70% / mean 711._

	根因（OpenCode deepseek-v4-pro 诊断）：DR.Q 同方差 σ=0.2 exploration noise 在 37D action 上几乎每一步都扰动手指，encoder 的 dynamics loss 被迫学习无关的手指动力学 → 250k step 时 catastrophic forgetting。

	_Root cause: isotropic σ=0.2 noise contaminates the encoder's dynamics loss with irrelevant 14-DoF finger motion, leading to catastrophic forgetting around 250k steps._

	---

	## 🔧 训练配置 / Training config

	\| \| H1-walk \| G1-walk \|
	\|---\|---\|---\|
	\| Algorithm \| DR.Q (TD3 + zs encoder) \| DR.Q + PD control + BlockedHands \|
	\| Env steps \| 500,000 \| 500,000 \|
	\| Wall time \| 6.6 h \| 3.0 h \|
	\| GPU \| RTX 4090 \| RTX 4090 \|
	\| `action_repeat` \| 2 \| 2 \|
	\| `save_freq` \| 50,000 \| 50,000 \|
	\| Watcher \| slice-based auto-eval + early-stop (LeIsaac-inspired) \| same \|

	训练流水线（三个并行进程，详见配套仓库 README）：
	- A) DR.Q `main.py` 主训
	- B) `scripts/train_watcher.py` — 分 10 slice 实时聚合 + PROGRESS/UNDERFIT/OVERFIT/DEAD 早停
	- C) `scripts/ckpt_eval_loop.py` — 每出 ckpt 自动 mirror 到 HF cache + N=3 deterministic eval

	---

	## 📚 引用 / Citations

	```bibtex
	@article{sferrazza2024humanoidbench,
	title={HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation},
	author={Sferrazza, Carmelo and Huang, Dun-Ming and Lin, Xingyu and Lee, Youngwoon and Abbeel, Pieter},
	journal={Robotics: Science and Systems},
	year={2024}
	}

	@article{yarats2022mastering,
	title={Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning},
	author={Yarats, Denis and Fergus, Rob and Lazaric, Alessandro and Pinto, Lerrel},
	journal={ICLR},
	year={2022}
	}
	```

	---

	## 📄 License

	MIT — same as base DR.Q and HumanoidBench.

	---

	_Companion repository_: [github.com/vitorcen/humanoid-training](https://github.com/vitorcen/humanoid-training) — full training scripts, patches, eval harness, and analysis docs.