| --- |
| library_name: drq |
| tags: |
| - reinforcement-learning |
| - humanoid |
| - mujoco |
| - humanoid-bench |
| - locomotion |
| - unitree-h1 |
| - unitree-g1 |
| datasets: |
| - carlosferrazza/humanoid-bench |
| base_model: dmux/DR.Q |
| license: mit |
| --- |
| |
| # HumanoidBench-DR.Q · 自训通关 checkpoints |
|
|
| _Self-trained DR.Q checkpoints that **beat** the public dmux/DR.Q baseline on HumanoidBench locomotion tasks._ |
|
|
| > 🛠 **训练源码 / Training source**: <https://github.com/vitorcen/humanoid-training> |
| > 完整训练脚本、patches、eval harness、分析文档全在 GitHub 配套仓库。 |
| > _Full training scripts, patches, eval harness, and analysis docs in the companion GitHub repo._ |
|
|
| DR.Q 是 TD3 + model-based 表征学习的离策略 RL 算法(encoder + policy ~13 MB 推理)。 |
| 本仓库收录在 [HumanoidBench](https://github.com/carlosferrazza/humanoid-bench) 上**从零自训通关**的 checkpoints。 |
|
|
| _DR.Q is a TD3-family off-policy RL algorithm with model-based representation learning (~13 MB encoder + policy at inference). |
| This repo hosts checkpoints **trained from scratch** that pass the HumanoidBench locomotion bar._ |
|
|
| --- |
|
|
| ## 📊 性能 / Performance |
|
|
| | Task | success_rate | mean_return | N | 对比公开 baseline | |
| |---|---|---|---|---| |
| | **`h1-walk-v0`** | **90%** | **801.05** | 10 ep × seed 0 | dmux/DR.Q seed 0: ~30% / mean ~530 | |
| | **`g1-walk-v0`** | **70%** | **710.52** | 10 ep × seed 0 | torque baseline: 0% / mean ~100 (**7.1× 提升**) | |
|
|
| `success_bar = 700` (HumanoidBench locomotion threshold). |
| _All numbers from deterministic eval with `action_repeat=2`. Raw JSONL in `eval/`._ |
|
|
| --- |
|
|
| ## 🎬 演示 / Demos |
|
|
| ### H1-walk-v0 (Unitree H1, 19 DoF) |
|
|
| <video controls width="720" src="https://huggingface.co/wsagi/HumanoidBench-DR.Q/resolve/main/assets/drq-h1-walk.mp4"></video> |
|
|
| ### G1-walk-v0 (Unitree G1, 23 DoF with BlockedHands wrapper) |
|
|
| <video controls width="720" src="https://huggingface.co/wsagi/HumanoidBench-DR.Q/resolve/main/assets/drq-g1-walk.mp4"></video> |
|
|
| --- |
|
|
| ## 📦 仓库内容 / Repo layout |
|
|
| ``` |
| HumanoidBench-DR.Q/ |
| ├── DRQ+HBench-h1-walk-v0+0/ # H1-walk self-trained ckpt (76 MB) |
| │ ├── encoder.pt policy.pt agent_var.npy ← inference (~13 MB) |
| │ ├── *_target.pt × 3 ← Q-learning targets |
| │ ├── *_optimizer.pt × 3 ← Adam states (resume) |
| │ ├── value.pt ← critic |
| │ └── exp_var.npy ← exploration variance |
| ├── DRQ+HBench-g1-walk-v0+0/ # G1-walk self-trained ckpt (76 MB) |
| │ └── ... (same 11-file layout) |
| ├── eval/ # Final eval JSONL (per-episode + summary row) |
| └── assets/ # MP4 demos |
| ``` |
|
|
| **推理只需 3 个文件**:`encoder.pt` + `policy.pt` + `agent_var.npy`(共 ~13 MB)。 |
| 其余 8 个文件用于续训与 Q-learning target。 |
|
|
| _Only 3 files are needed for inference; the rest are for resume-training and Q-learning targets._ |
|
|
| --- |
|
|
| ## 🚀 加载与推理 / Load & inference |
|
|
| ```python |
| # Minimal inference loader — see scripts/drq_viewer.py in companion repo |
| import torch, numpy as np |
| from huggingface_hub import snapshot_download |
| |
| ckpt_dir = snapshot_download( |
| repo_id="wsagi/HumanoidBench-DR.Q", |
| allow_patterns="DRQ+HBench-h1-walk-v0+0/*", |
| ) |
| |
| var = np.load(f"{ckpt_dir}/DRQ+HBench-h1-walk-v0+0/agent_var.npy", allow_pickle=True).item() |
| # encoder = Encoder(obs_dim, ...); encoder.load_state_dict(torch.load(.../encoder.pt)) |
| # policy = Policy(...); policy.load_state_dict(torch.load(.../policy.pt)) |
| ``` |
|
|
| 完整加载链路参考配套仓库 [vitorcen/humanoid-training](https://github.com/vitorcen/humanoid-training) 的 `scripts/drq_viewer.py`。 |
|
|
| ```bash |
| git clone --recursive https://github.com/vitorcen/humanoid-training |
| cd humanoid-training |
| bash patches/apply.sh # apply DR.Q + HumanoidBench local patches |
| DISPLAY=:0 python scripts/drq_viewer.py --task h1-walk-v0 --seed 0 |
| DISPLAY=:0 python scripts/drq_viewer.py --task g1-walk-v0 --seed 0 |
| ``` |
|
|
| --- |
|
|
| ## ⚠️ G1-walk 必备 patches / Required patches for G1 |
|
|
| G1 通关**不是开箱即用**,需两层 patch(详见 [g1_training_strategies.html](https://github.com/vitorcen/humanoid-training/blob/main/docs/g1_training_strategies.html)): |
|
|
| | Patch | 作用 | |
| |---|---| |
| | `patches/g1-pos-control.patch` | G1 默认 torque control → **PD position control**(与 H1 一致),sample efficiency 4×↑ | |
| | `patches/humanoid-bench-g1-blocked-hands.patch` | 扩展 `BlockedHandsLocoWrapper` 支持 G1,**屏蔽 14 维手指**(37D → 23D action),避免噪声污染 encoder dynamics loss | |
|
|
| _G1 raw torque baseline trained 1M steps and stayed at 0% / mean ~100. The combined patches lift it to 70% / mean 711._ |
|
|
| **根因(OpenCode deepseek-v4-pro 诊断)**:DR.Q 同方差 σ=0.2 exploration noise 在 37D action 上几乎每一步都扰动手指,encoder 的 dynamics loss 被迫学习无关的手指动力学 → 250k step 时 catastrophic forgetting。 |
|
|
| _Root cause: isotropic σ=0.2 noise contaminates the encoder's dynamics loss with irrelevant 14-DoF finger motion, leading to catastrophic forgetting around 250k steps._ |
|
|
| --- |
|
|
| ## 🔧 训练配置 / Training config |
|
|
| | | H1-walk | G1-walk | |
| |---|---|---| |
| | Algorithm | DR.Q (TD3 + zs encoder) | DR.Q + PD control + BlockedHands | |
| | Env steps | 500,000 | 500,000 | |
| | Wall time | 6.6 h | 3.0 h | |
| | GPU | RTX 4090 | RTX 4090 | |
| | `action_repeat` | 2 | 2 | |
| | `save_freq` | 50,000 | 50,000 | |
| | Watcher | slice-based auto-eval + early-stop (LeIsaac-inspired) | same | |
|
|
| **训练流水线**(三个并行进程,详见配套仓库 README): |
| - A) DR.Q `main.py` 主训 |
| - B) `scripts/train_watcher.py` — 分 10 slice 实时聚合 + PROGRESS/UNDERFIT/OVERFIT/DEAD 早停 |
| - C) `scripts/ckpt_eval_loop.py` — 每出 ckpt 自动 mirror 到 HF cache + N=3 deterministic eval |
|
|
| --- |
|
|
| ## 📚 引用 / Citations |
|
|
| ```bibtex |
| @article{sferrazza2024humanoidbench, |
| title={HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation}, |
| author={Sferrazza, Carmelo and Huang, Dun-Ming and Lin, Xingyu and Lee, Youngwoon and Abbeel, Pieter}, |
| journal={Robotics: Science and Systems}, |
| year={2024} |
| } |
| |
| @article{yarats2022mastering, |
| title={Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning}, |
| author={Yarats, Denis and Fergus, Rob and Lazaric, Alessandro and Pinto, Lerrel}, |
| journal={ICLR}, |
| year={2022} |
| } |
| ``` |
|
|
| --- |
|
|
| ## 📄 License |
|
|
| MIT — same as base DR.Q and HumanoidBench. |
|
|
| --- |
|
|
| _Companion repository_: [github.com/vitorcen/humanoid-training](https://github.com/vitorcen/humanoid-training) — full training scripts, patches, eval harness, and analysis docs. |
|
|