dmux
/

DR.Q

 ---
 license: mit
+tags:
+  - reinforcement-learning
+  - continuous-control
+  - model-based-representation
+  - mujoco
+  - deepmind-control-suite
+  - humanoidbench
+  - pytorch
+  - td3
+  - representation-learning
+library_name: pytorch
+pipeline_tag: reinforcement-learning
 ---
+# DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control
+[![Paper](https://img.shields.io/badge/Paper-ICML2026-purple)](https://openreview.net/forum?id=ZP1p8k106p)
+[![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/dmksjfl/DR.Q)
+[![License](https://img.shields.io/badge/License-MIT-green)](https://github.com/dmksjfl/DR.Q/blob/master/LICENSE)
+Official pretrained model weights for **DR.Q**, presented at the **Forty-third International Conference on Machine Learning (ICML 2026)**.
+> **Authors:** Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
+---
+## Model Description
+DR.Q is a **model-free reinforcement learning** algorithm that achieves strong sample efficiency in continuous control by learning *debiased* model-based representations. The key insight is that naively training model-based representations can introduce representation bias that hurts downstream policy learning. DR.Q debiases two sources of biases:
+1. **Representation bias** — mitigated by adding InfoNCE loss besides the MSE loss
+2. **Sampling bias** — mitigated by introducing the faded prioritized experience replay (Faded PER)
+DR.Q builds upon and substantially extends the [MR.Q codebase](https://github.com/facebookresearch/MRQ) (Facebook Research).
+## How to Get Started
+### Installation
+```bash
+git clone https://github.com/dmksjfl/DR.Q
+cd DR.Q
+pip install -r requirements.txt
+```
+### Training
+```bash
+# Gym / MuJoCo (1M steps)
+python main.py --env Gym-HalfCheetah-v4
+python main.py --env Gym-Humanoid-v4
+# DeepMind Control Suite — proprioceptive (500K steps)
+python main.py --env Dmc-cheetah-run
+python main.py --env Dmc-quadruped-walk
+# DeepMind Control Suite — pixel observations
+python main.py --env Dmc-visual-dog-run
+python main.py --env Dmc-visual-walker-walk
+# HumanoidBench (requires separate installation)
+python main.py --env HBench-h1-run-v0
+```
+### Loading Pretrained Weights
+Pretrained model weights for all reported tasks are hosted here on HuggingFace
+---
+## Training Details
+### Evaluated Benchmark Suites
+| Suite | Obs. Type | Steps | Tasks |
+|---|---|---|---|
+| Gym MuJoCo (Gymnasium) | Proprioceptive | 1M | 5 tasks |
+| DeepMind Control (DMC) — Easy | Proprioceptive | 500K | 21 tasks |
+| DeepMind Control (DMC) — Hard | Proprioceptive | 500K | 7 tasks |
+| DeepMind Control (DMC) — Visual | Pixel (84×84) | 500K | 12 tasks |
+| HumanoidBench (w/o hands) | Proprioceptive | 500K | 14 tasks |
+| HumanoidBench (w/ hands) | Proprioceptive | 500K | 14 tasks |
+### Training Infrastructure
+- **Framework:** PyTorch ≥ 2.3.0
+- **Python:** 3.11 (compatible with 3.9–3.12)
+- **Hardware:** CUDA GPU (CPU also supported)
+- **Seeds:** Results averaged over 10 random seeds with 95% bootstrap confidence intervals
+---
+## Evaluation Results
+All results report the **final average return** at the end of training. Aggregate metrics (IQM, Median, Mean) are computed over the task-specific normalized score. Values in [brackets] denote **95% bootstrap confidence intervals**.
+---
+### Gym MuJoCo Tasks (1M environment steps)
+Full comparison against domain-specific and general model-free / model-based RL algorithms. Aggregate metrics are computed over the TD3-normalized score.
+| Task | TD7 | TDMPC2 | MR.Q | FoG | SimbaV2 | **DR.Q** |
+|---|---|---|---|---|---|---|
+| Ant-v4 | 8509 [8168, 8844] | 4751 [2988, 6145] | 6901 [6261, 7482] | 6761 [6161, 7360] | 7429 [7209, 7649] | **8138** [7764, 8511] |
+| HalfCheetah-v4 | **17433** [17301, 17559] | 15078 [14065, 15932] | 12939 [11663, 13762] | 11709 [9928, 13491] | 12022 [11640, 12404] | 14775 [14638, 14912] |
+| Hopper-v4 | 3511 [3236, 3736] | 2081 [1197, 2921] | 2692 [2131, 3309] | 1822 [1316, 2327] | **4054** [3929, 4179] | 2504 [1931, 3077] |
+| Humanoid-v4 | 7428 [7304, 7553] | 6071 [5770, 6333] | 10223 [9929, 10498] | 6737 [6319, 7155] | 10546 [10195, 10897] | **11239** [11052, 11426] |
+| Walker2d-v4 | 6096 [5621, 6547] | 3008 [1706, 4321] | 6039 [5644, 6386] | 5124 [4719, 5529] | **6938** [6691, 7185] | 6422 [5123, 7721] |
+| **IQM** | 1.540 [1.500, 1.580] | 1.050 [0.890, 1.190] | 1.499 [1.361, 1.650] | 1.242 [1.117, 1.349] | 1.637 [1.470, 1.791] | **1.691** [1.473, 1.879] |
+| **Median** | 1.550 [1.450, 1.630] | 1.180 [0.830, 1.220] | 1.488 [1.340, 1.623] | 1.261 [1.080, 1.344] | 1.616 [1.490, 1.744] | **1.564** [1.416, 1.806] |
+| **Mean** | 1.570 [1.540, 1.600] | 1.040 [0.920, 1.150] | 1.465 [1.346, 1.585] | 1.196 [1.082, 1.307] | 1.617 [1.513, 1.718] | **1.608** [1.449, 1.759] |
+---
+### DMC-Easy Tasks (500K steps / 1M env steps with action repeat 2)
+Aggregate metrics reported in units of 1k.
+| Task | MR.Q | Simba | SimbaV2 | FoG | **DR.Q** |
+|---|---|---|---|---|---|
+| acrobot-swingup | 567 [523, 616] | 431 [379, 482] | 436 [391, 482] | 414 [344, 485] | **569** [519, 619] |
+| ball-in-cup-catch | 981 [979, 984] | 981 [978, 983] | 982 [980, 984] | **983** [981, 985] | 980 [979, 982] |
+| cartpole-balance | **999** [999, 1000] | 998 [998, 999] | 999 [999, 999] | 997 [996, 999] | **999** [999, 1000] |
+| cartpole-balance-sparse | **1000** [1000, 1000] | 991 [973, 1008] | 967 [904, 1030] | **1000** [1000, 1000] | 987 [963, 1012] |
+| cartpole-swingup | 866 [866, 866] | 876 [871, 881] | 880 [876, 883] | **881** [880, 882] | 867 [866, 867] |
+| cartpole-swingup-sparse | 798 [780, 818] | 825 [795, 854] | **848** [848, 849] | 840 [829, 850] | 805 [791, 818] |
+| cheetah-run | 877 [849, 905] | **920** [918, 922] | 821 [642, 913] | 838 [732, 944] | 911 [905, 918] |
+| finger-spin | 937 [917, 956] | 849 [758, 939] | 891 [810, 972] | **987** [986, 989] | 949 [917, 980] |
+| finger-turn-easy | 953 [931, 974] | 935 [903, 968] | 953 [925, 980] | 949 [920, 977] | **956** [932, 980] |
+| finger-turn-hard | 950 [910, 974] | 915 [859, 972] | **951** [925, 977] | 921 [863, 978] | 949 [923, 975] |
+| fish-swim | 792 [773, 810] | 823 [799, 846] | **826** [806, 846] | 744 [701, 786] | 808 [788, 828] |
+| hopper-hop | 251 [195, 301] | **385** [322, 449] | 290 [233, 348] | 335 [326, 345] | 384 [317, 451] |
+| hopper-stand | 951 [948, 955] | 929 [900, 957] | 944 [926, 962] | **956** [953, 959] | 954 [949, 959] |
+| pendulum-swingup | 748 [597, 829] | 737 [575, 899] | 827 [805, 849] | 838 [810, 866] | **835** [819, 852] |
+| quadruped-run | 947 [940, 954] | 928 [916, 939] | 935 [928, 943] | 918 [906, 929] | **953** [949, 957] |
+| quadruped-walk | 963 [959, 967] | 957 [951, 963] | 962 [955, 969] | 963 [960, 966] | **969** [964, 973] |
+| reacher-easy | **983** [983, 985] | **983** [981, 986] | **983** [979, 986] | 980 [971, 990] | 975 [958, 993] |
+| reacher-hard | **977** [975, 980] | 966 [947, 984] | 967 [946, 987] | 965 [944, 986] | 976 [973, 979] |
+| walker-run | 793 [765, 815] | 796 [792, 801] | 817 [812, 821] | **851** [848, 853] | 809 [775, 844] |
+| walker-stand | 988 [987, 990] | 985 [982, 989] | 987 [984, 990] | 987 [985, 989] | **991** [989, 992] |
+| walker-walk | 978 [978, 980] | 975 [972, 978] | 976 [974, 978] | 978 [977, 980] | **979** [976, 982] |
+| **IQM** | 0.936 [0.917, 0.952] | 0.922 [0.905, 0.938] | 0.933 [0.918, 0.948] | 0.935 [0.919, 0.951] | **0.937** [0.920, 0.951] |
+| **Median** | 0.876 [0.847, 0.905] | 0.870 [0.841, 0.896] | 0.875 [0.847, 0.905] | 0.874 [0.845, 0.904] | **0.885** [0.863, 0.912] |
+| **Mean** | 0.874 [0.848, 0.898] | 0.864 [0.840, 0.887] | 0.874 [0.849, 0.897] | 0.873 [0.847, 0.897] | **0.886** [0.865, 0.906] |
+---
+### DMC-Hard Tasks (500K steps / 1M env steps with action repeat 2)
+Aggregate metrics reported in units of 1k.
+| Task | TDMPC2 | MR.Q | Simba | SimbaV2 | FoG | **DR.Q** |
+|---|---|---|---|---|---|---|
+| dog-run | 265 [166, 342] | 569 [547, 595] | 544 [525, 564] | 562 [516, 608] | 613 [577, 648] | **721** [684, 758] |
+| dog-stand | 506 [266, 715] | 967 [960, 975] | 960 [951, 969] | **981** [977, 985] | 976 [969, 982] | 972 [963, 982] |
+| dog-trot | 407 [265, 530] | 877 [845, 898] | 824 [773, 876] | 861 [772, 950] | 901 [892, 911] | **925** [914, 936] |
+| dog-walk | 486 [240, 704] | 916 [908, 924] | 916 [905, 928] | 935 [927, 944] | 921 [909, 933] | **950** [942, 958] |
+| humanoid-run | 181 [121, 231] | 200 [170, 236] | 181 [171, 191] | 194 [182, 207] | 292 [268, 317] | **465** [444, 485] |
+| humanoid-stand | 658 [506, 745] | 868 [822, 903] | 846 [801, 890] | 916 [886, 945] | 931 [921, 941] | **938** [932, 944] |
+| humanoid-walk | 754 [725, 791] | 662 [610, 724] | 668 [608, 728] | 651 [590, 713] | 878 [839, 917] | **925** [918, 932] |
+| **IQM** | 0.464 [0.305, 0.632] | 0.796 [0.724, 0.860] | 0.773 [0.713, 0.830] | 0.808 [0.726, 0.879] | 0.880 [0.818, 0.914] | **0.917** [0.871, 0.936] |
+| **Median** | 0.486 [0.265, 0.658] | 0.722 [0.654, 0.797] | 0.706 [0.647, 0.772] | 0.729 [0.655, 0.808] | 0.788 [0.724, 0.855] | **0.844** [0.796, 0.893] |
+| **Mean** | 0.465 [0.329, 0.606] | 0.723 [0.660, 0.781] | 0.706 [0.656, 0.755] | 0.729 [0.664, 0.791] | 0.787 [0.730, 0.840] | **0.842** [0.800, 0.881] |
+---
+### DMC Visual Tasks (500K steps / 1M env steps with action repeat 2)
+Pixel-based observations at 84×84 resolution. Aggregate metrics computed over the success normalized score.
+| Task | DrQ-v2 | PPO | TDMPC2 | DreamerV3 | MR.Q | **DR.Q** |
+|---|---|---|---|---|---|---|
+| acrobot-swingup | 168 [127, 219] | 2 [1, 4] | 197 [179, 217] | 121 [106, 145] | 287 [254, 316] | **324** [283, 365] |
+| dog-run | 10 [9, 12] | 11 [9, 14] | 14 [10, 18] | 9 [6, 14] | 60 [44, 80] | **118** [104, 132] |
+| dog-stand | 43 [37, 49] | 51 [48, 56] | 117 [72, 148] | 61 [30, 92] | 216 [201, 232] | **700** [660, 740] |
+| dog-trot | 14 [11, 18] | 13 [12, 15] | 20 [14, 25] | 14 [13, 16] | 65 [55, 79] | **113** [98, 128] |
+| dog-walk | 22 [18, 29] | 16 [14, 18] | 22 [17, 28] | 11 [11, 12] | 77 [71, 83] | **201** [146, 256] |
+| hopper-hop | 224 [170, 278] | 0 [0, 0] | 187 [119, 238] | 205 [125, 287] | 270 [230, 315] | **330** [283, 377] |
+| hopper-stand | 917 [903, 931] | 1 [0, 2] | 582 [321, 794] | 888 [875, 900] | 852 [703, 930] | **937** [930, 944] |
+| humanoid-run | 1 [1, 1] | 1 [1, 1] | 0 [1, 1] | 1 [1, 1] | 1 [1, 2] | **1** [1, 1] |
+| quadruped-run | 459 [412, 507] | 118 [98, 139] | 262 [184, 330] | 328 [255, 397] | 498 [476, 522] | **655** [573, 737] |
+| quadruped-walk | 750 [699, 796] | 149 [113, 184] | 246 [179, 310] | 316 [260, 379] | 833 [797, 867] | **927** [914, 941] |
+| reacher-hard | 705 [580, 831] | 10 [0, 30] | **911** [867, 946] | 338 [227, 461] | 965 [945, 977] | 954 [930, 979] |
+| walker-run | 546 [475, 612] | 39 [35, 44] | 665 [566, 719] | 669 [615, 708] | 615 [571, 655] | **746** [713, 778] |
+| **IQM** | 0.241 [0.214, 0.271] | 0.016 [0.013, 0.018] | 0.154 [0.113, 0.224] | 0.168 [0.152, 0.184] | 0.322 [0.239, 0.423] | **0.494** [0.395, 0.604] |
+| **Median** | 0.191 [0.172, 0.211] | 0.013 [0.012, 0.013] | 0.295 [0.198, 0.339] | 0.134 [0.124, 0.198] | 0.398 [0.320, 0.466] | **0.500** [0.427, 0.576] |
+| **Mean** | 0.321 [0.303, 0.340] | 0.034 [0.031, 0.037] | 0.269 [0.214, 0.326] | 0.247 [0.231, 0.262] | 0.395 [0.335, 0.457] | **0.501** [0.439, 0.564] |
+---
+### HumanoidBench — Without Dexterous Hands (500K steps / 1M env steps with action repeat 2)
+Aggregate metrics computed over the success normalized score.
+| Task | Simba | SimbaV2 | MR.Q | FoG | **DR.Q** |
+|---|---|---|---|---|---|
+| h1-pole-v0 | 716 [667, 765] | 791 [785, 797] | 578 [534, 623] | **893** [846, 940] | 887 [853, 921] |
+| h1-slide-v0 | 277 [252, 303] | 487 [404, 571] | 303 [270, 337] | **674** [562, 785] | 355 [324, 386] |
+| h1-stair-v0 | 269 [153, 385] | **493** [467, 518] | 235 [213, 257] | 466 [383, 548] | 401 [328, 475] |
+| h1-balance-hard-v0 | 75 [71, 80] | 143 [128, 157] | 69 [67, 72] | 81 [71, 91] | **92** [87, 97] |
+| h1-balance-simple-v0 | 337 [193, 482] | **723** [651, 795] | 135 [110, 160] | 616 [536, 696] | 205 [166, 244] |
+| h1-sit-hard-v0 | 512 [354, 670] | 679 [548, 811] | 553 [421, 686] | 770 [738, 802] | **843** [747, 939] |
+| h1-sit-simple-v0 | 833 [814, 853] | 875 [870, 880] | 850 [819, 882] | 828 [800, 856] | **931** [924, 938] |
+| h1-maze-v0 | 354 [342, 366] | 313 [287, 340] | 344 [340, 347] | 331 [310, 353] | **354** [349, 359] |
+| h1-crawl-v0 | 923 [904, 942] | 946 [933, 959] | 932 [919, 945] | 971 [969, 973] | **973** [972, 974] |
+| h1-hurdle-v0 | 175 [150, 201] | 202 [167, 236] | 131 [108, 155] | 114 [100, 129] | **344** [245, 443] |
+| h1-reach-v0 | 3874 [3220, 4527] | 3850 [3272, 4427] | 4902 [4390, 5414] | 2434 [2083, 2785] | **8101** [7640, 8563] |
+| h1-run-v0 | 232 [185, 279] | 415 [307, 524] | 278 [192, 364] | 749 [666, 832] | **820** [815, 824] |
+| h1-stand-v0 | 772 [701, 843] | 814 [770, 857] | 800 [754, 846] | 671 [516, 825] | **856** [815, 897] |
+| h1-walk-v0 | 550 [391, 709] | 845 [840, 850] | 716 [657, 775] | **866** [859, 872] | 850 [830, 869] |
+| **IQM** | 0.521 [0.413, 0.633] | 0.799 [0.686, 0.908] | 0.519 [0.417, 0.630] | 0.846 [0.713, 0.969] | **0.864** [0.735, 0.976] |
+| **Median** | 0.598 [0.514, 0.692] | 0.781 [0.693, 0.865] | 0.602 [0.516, 0.687] | 0.794 [0.705, 0.899] | **0.823** [0.733, 0.920] |
+| **Mean** | 0.606 [0.536, 0.678] | 0.776 [0.705, 0.849] | 0.604 [0.531, 0.677] | 0.802 [0.721, 0.883] | **0.825** [0.748, 0.902] |
+---
+### HumanoidBench — With Dexterous Hands (500K steps / 1M env steps with action repeat 2)
+Aggregate metrics computed over the success normalized score.
+| Task | DreamerV3 | TDMPC2 | SimBa | SimbaV2 | MR.Q | FoG | **DR.Q** |
+|---|---|---|---|---|---|---|---|
+| h1hand-door-v0 | 10 [7, 13] | 134 [23, 246] | 206 [169, 244] | 310 [302, 318] | 293 [280, 305] | 244 [227, 261] | **320** [308, 333] |
+| h1hand-slide-v0 | 21 [19, 23] | 79 [68, 90] | 67 [55, 79] | 136 [97, 175] | 146 [131, 161] | 201 [173, 228] | **285** [258, 312] |
+| h1hand-stair-v0 | 16 [8, 25] | 43 [35, 51] | 61 [44, 78] | 120 [89, 151] | 127 [104, 150] | **135** [126, 144] | 288 [193, 382] |
+| h1hand-bookshelf-simple-v0 | 45 [41, 50] | 97 [59, 134] | 487 [315, 660] | **838** [834, 843] | 691 [599, 783] | 610 [523, 697] | 709 [572, 846] |
+| h1hand-bookshelf-hard-v0 | 27 [24, 30] | 34 [19, 50] | 490 [447, 533] | 496 [417, 575] | 332 [240, 425] | **577** [548, 605] | 349 [262, 435] |
+| h1hand-sit-simple-v0 | 48 [42, 54] | 607 [268, 947] | 643 [580, 705] | 927 [904, 951] | 653 [568, 737] | 631 [528, 735] | **942** [926, 958] |
+| h1hand-sit-hard-v0 | 15 [11, 20] | 139 [86, 193] | 649 [500, 797] | 724 [609, 838] | 487 [353, 621] | 179 [128, 229] | **891** [841, 941] |
+| h1hand-basketball-v0 | 13 [12, 13] | 47 [21, 73] | 54 [25, 83] | 56 [34, 78] | 53 [34, 72] | **182** [131, 232] | 75 [45, 105] |
+| h1hand-pole-v0 | 48 [36, 60] | 99 [87, 111] | 224 [195, 254] | **493** [426, 559] | 237 [202, 273] | 257 [237, 277] | 424 [299, 549] |
+| h1hand-crawl-v0 | 256 [244, 268] | **897** [858, 935] | 779 [748, 809] | 640 [549, 732] | 807 [783, 831] | 794 [721, 866] | 526 [477, 574] |
+| h1hand-reach-v0 | 864 [578, 1150] | 3610 [2912, 4309] | 3185 [2664, 3707] | 3223 [2703, 3744] | 4101 [3540, 4662] | 2877 [2487, 3267] | **4950** [4280, 5619] |
+| h1hand-run-v0 | 6 [4, 8] | 29 [27, 30] | 31 [24, 37] | 30 [22, 38] | **35** [29, 41] | 22 [19, 25] | 129 [77, 181] |
+| h1hand-stand-v0 | 41 [38, 44] | 193 [147, 238] | 127 [72, 181] | 103 [81, 126] | 300 [194, 405] | 79 [66, 91] | **491** [344, 638] |
+| h1hand-walk-v0 | 19 [12, 27] | 234 [125, 343] | 94 [79, 109] | 64 [52, 76] | 95 [77, 112] | 75 [63, 87] | **512** [371, 652] |
+| **IQM** | 0.019 [0.013, 0.026] | 0.150 [0.091, 0.224] | 0.219 [0.179, 0.267] | 0.298 [0.241, 0.374] | 0.286 [0.245, 0.333] | 0.254 [0.222, 0.285] | **0.452** [0.400, 0.512] |
+| **Median** | 0.021 [0.010, 0.030] | 0.298 [0.147, 0.433] | 0.356 [0.269, 0.413] | 0.420 [0.338, 0.491] | 0.388 [0.313, 0.449] | 0.342 [0.268, 0.395] | **0.529** [0.455, 0.607] |
+| **Mean** | 0.020 [0.011, 0.028] | 0.282 [0.169, 0.413] | 0.345 [0.286, 0.406] | 0.417 [0.356, 0.482] | 0.385 [0.329, 0.443] | 0.336 [0.285, 0.393] | **0.534** [0.473, 0.595] |
+---
+## Citation
+```bibtex
+@inproceedings{lyu2026debiased,
+  title={Debiased Model-based Representations for Sample-efficient Continuous Control},
+  author={Jiafei Lyu and Zichuan Lin and Scott Fujimoto and Kai Yang and Yangkun Chen and Saiyong Yang and Zongqing Lu and Deheng Ye},
+  booktitle={Forty-third International Conference on Machine Learning},
+  year={2026},
+  url={https://openreview.net/forum?id=ZP1p8k106p}
+}
+```
+---
+## Acknowledgements
+DR.Q builds upon the [MR.Q codebase](https://github.com/facebookresearch/MRQ) by Facebook Research. We thank the authors of TD7, TDMPC2, MR.Q, FoG, SimBa, SimbaV2, DrQ-v2, DreamerV3, and PPO for their open-source implementations used as baselines.