Add Hugging Face paper link and framework diagram
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,22 +1,23 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
-
tags:
|
| 4 |
-
- reinforcement-learning
|
| 5 |
-
- continuous-control
|
| 6 |
-
- model-based-representation
|
| 7 |
-
- mujoco
|
| 8 |
-
- deepmind-control-suite
|
| 9 |
-
- humanoidbench
|
| 10 |
-
- pytorch
|
| 11 |
-
- td3
|
| 12 |
-
- representation-learning
|
| 13 |
library_name: pytorch
|
|
|
|
| 14 |
pipeline_tag: reinforcement-learning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
# DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control
|
| 18 |
|
| 19 |
[](https://openreview.net/forum?id=ZP1p8k106p)
|
|
|
|
| 20 |
[](https://github.com/dmksjfl/DR.Q)
|
| 21 |
[](https://github.com/dmksjfl/DR.Q/blob/master/LICENSE)
|
| 22 |
|
|
@@ -24,6 +25,11 @@ Official pretrained model weights for **DR.Q**, presented at the **Forty-third I
|
|
| 24 |
|
| 25 |
> **Authors:** Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Model Description
|
| 29 |
|
|
@@ -65,7 +71,7 @@ python main.py --env HBench-h1-run-v0
|
|
| 65 |
|
| 66 |
### Loading Pretrained Weights
|
| 67 |
|
| 68 |
-
Pretrained model weights for all reported tasks are hosted here on HuggingFace
|
| 69 |
|
| 70 |
## Training Details
|
| 71 |
|
|
@@ -92,11 +98,8 @@ Pretrained model weights for all reported tasks are hosted here on HuggingFace
|
|
| 92 |
|
| 93 |
All results report the **final average return** at the end of training. Aggregate metrics (IQM, Median, Mean) are computed over the task-specific normalized score. Values in [brackets] denote **95% bootstrap confidence intervals**.
|
| 94 |
|
| 95 |
-
|
| 96 |
### Gym MuJoCo Tasks (1M environment steps)
|
| 97 |
|
| 98 |
-
Full comparison against domain-specific and general model-free / model-based RL algorithms. Aggregate metrics are computed over the TD3-normalized score.
|
| 99 |
-
|
| 100 |
| Task | TD7 | TDMPC2 | MR.Q | FoG | SimbaV2 | **DR.Q** |
|
| 101 |
|---|---|---|---|---|---|---|
|
| 102 |
| Ant-v4 | 8509 [8168, 8844] | 4751 [2988, 6145] | 6901 [6261, 7482] | 6761 [6161, 7360] | 7429 [7209, 7649] | **8138** [7764, 8511] |
|
|
@@ -108,129 +111,7 @@ Full comparison against domain-specific and general model-free / model-based RL
|
|
| 108 |
| **Median** | 1.550 [1.450, 1.630] | 1.180 [0.830, 1.220] | 1.488 [1.340, 1.623] | 1.261 [1.080, 1.344] | 1.616 [1.490, 1.744] | **1.564** [1.416, 1.806] |
|
| 109 |
| **Mean** | 1.570 [1.540, 1.600] | 1.040 [0.920, 1.150] | 1.465 [1.346, 1.585] | 1.196 [1.082, 1.307] | 1.617 [1.513, 1.718] | **1.608** [1.449, 1.759] |
|
| 110 |
|
| 111 |
-
|
| 112 |
-
### DMC-Easy Tasks (500K steps / 1M env steps with action repeat 2)
|
| 113 |
-
|
| 114 |
-
Aggregate metrics reported in units of 1k.
|
| 115 |
-
|
| 116 |
-
| Task | MR.Q | Simba | SimbaV2 | FoG | **DR.Q** |
|
| 117 |
-
|---|---|---|---|---|---|
|
| 118 |
-
| acrobot-swingup | 567 [523, 616] | 431 [379, 482] | 436 [391, 482] | 414 [344, 485] | **569** [519, 619] |
|
| 119 |
-
| ball-in-cup-catch | 981 [979, 984] | 981 [978, 983] | 982 [980, 984] | **983** [981, 985] | 980 [979, 982] |
|
| 120 |
-
| cartpole-balance | **999** [999, 1000] | 998 [998, 999] | 999 [999, 999] | 997 [996, 999] | **999** [999, 1000] |
|
| 121 |
-
| cartpole-balance-sparse | **1000** [1000, 1000] | 991 [973, 1008] | 967 [904, 1030] | **1000** [1000, 1000] | 987 [963, 1012] |
|
| 122 |
-
| cartpole-swingup | 866 [866, 866] | 876 [871, 881] | 880 [876, 883] | **881** [880, 882] | 867 [866, 867] |
|
| 123 |
-
| cartpole-swingup-sparse | 798 [780, 818] | 825 [795, 854] | **848** [848, 849] | 840 [829, 850] | 805 [791, 818] |
|
| 124 |
-
| cheetah-run | 877 [849, 905] | **920** [918, 922] | 821 [642, 913] | 838 [732, 944] | 911 [905, 918] |
|
| 125 |
-
| finger-spin | 937 [917, 956] | 849 [758, 939] | 891 [810, 972] | **987** [986, 989] | 949 [917, 980] |
|
| 126 |
-
| finger-turn-easy | 953 [931, 974] | 935 [903, 968] | 953 [925, 980] | 949 [920, 977] | **956** [932, 980] |
|
| 127 |
-
| finger-turn-hard | 950 [910, 974] | 915 [859, 972] | **951** [925, 977] | 921 [863, 978] | 949 [923, 975] |
|
| 128 |
-
| fish-swim | 792 [773, 810] | 823 [799, 846] | **826** [806, 846] | 744 [701, 786] | 808 [788, 828] |
|
| 129 |
-
| hopper-hop | 251 [195, 301] | **385** [322, 449] | 290 [233, 348] | 335 [326, 345] | 384 [317, 451] |
|
| 130 |
-
| hopper-stand | 951 [948, 955] | 929 [900, 957] | 944 [926, 962] | **956** [953, 959] | 954 [949, 959] |
|
| 131 |
-
| pendulum-swingup | 748 [597, 829] | 737 [575, 899] | 827 [805, 849] | 838 [810, 866] | **835** [819, 852] |
|
| 132 |
-
| quadruped-run | 947 [940, 954] | 928 [916, 939] | 935 [928, 943] | 918 [906, 929] | **953** [949, 957] |
|
| 133 |
-
| quadruped-walk | 963 [959, 967] | 957 [951, 963] | 962 [955, 969] | 963 [960, 966] | **969** [964, 973] |
|
| 134 |
-
| reacher-easy | **983** [983, 985] | **983** [981, 986] | **983** [979, 986] | 980 [971, 990] | 975 [958, 993] |
|
| 135 |
-
| reacher-hard | **977** [975, 980] | 966 [947, 984] | 967 [946, 987] | 965 [944, 986] | 976 [973, 979] |
|
| 136 |
-
| walker-run | 793 [765, 815] | 796 [792, 801] | 817 [812, 821] | **851** [848, 853] | 809 [775, 844] |
|
| 137 |
-
| walker-stand | 988 [987, 990] | 985 [982, 989] | 987 [984, 990] | 987 [985, 989] | **991** [989, 992] |
|
| 138 |
-
| walker-walk | 978 [978, 980] | 975 [972, 978] | 976 [974, 978] | 978 [977, 980] | **979** [976, 982] |
|
| 139 |
-
| **IQM** | 0.936 [0.917, 0.952] | 0.922 [0.905, 0.938] | 0.933 [0.918, 0.948] | 0.935 [0.919, 0.951] | **0.937** [0.920, 0.951] |
|
| 140 |
-
| **Median** | 0.876 [0.847, 0.905] | 0.870 [0.841, 0.896] | 0.875 [0.847, 0.905] | 0.874 [0.845, 0.904] | **0.885** [0.863, 0.912] |
|
| 141 |
-
| **Mean** | 0.874 [0.848, 0.898] | 0.864 [0.840, 0.887] | 0.874 [0.849, 0.897] | 0.873 [0.847, 0.897] | **0.886** [0.865, 0.906] |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
### DMC-Hard Tasks (500K steps / 1M env steps with action repeat 2)
|
| 145 |
-
|
| 146 |
-
Aggregate metrics reported in units of 1k.
|
| 147 |
-
|
| 148 |
-
| Task | TDMPC2 | MR.Q | Simba | SimbaV2 | FoG | **DR.Q** |
|
| 149 |
-
|---|---|---|---|---|---|---|
|
| 150 |
-
| dog-run | 265 [166, 342] | 569 [547, 595] | 544 [525, 564] | 562 [516, 608] | 613 [577, 648] | **721** [684, 758] |
|
| 151 |
-
| dog-stand | 506 [266, 715] | 967 [960, 975] | 960 [951, 969] | **981** [977, 985] | 976 [969, 982] | 972 [963, 982] |
|
| 152 |
-
| dog-trot | 407 [265, 530] | 877 [845, 898] | 824 [773, 876] | 861 [772, 950] | 901 [892, 911] | **925** [914, 936] |
|
| 153 |
-
| dog-walk | 486 [240, 704] | 916 [908, 924] | 916 [905, 928] | 935 [927, 944] | 921 [909, 933] | **950** [942, 958] |
|
| 154 |
-
| humanoid-run | 181 [121, 231] | 200 [170, 236] | 181 [171, 191] | 194 [182, 207] | 292 [268, 317] | **465** [444, 485] |
|
| 155 |
-
| humanoid-stand | 658 [506, 745] | 868 [822, 903] | 846 [801, 890] | 916 [886, 945] | 931 [921, 941] | **938** [932, 944] |
|
| 156 |
-
| humanoid-walk | 754 [725, 791] | 662 [610, 724] | 668 [608, 728] | 651 [590, 713] | 878 [839, 917] | **925** [918, 932] |
|
| 157 |
-
| **IQM** | 0.464 [0.305, 0.632] | 0.796 [0.724, 0.860] | 0.773 [0.713, 0.830] | 0.808 [0.726, 0.879] | 0.880 [0.818, 0.914] | **0.917** [0.871, 0.936] |
|
| 158 |
-
| **Median** | 0.486 [0.265, 0.658] | 0.722 [0.654, 0.797] | 0.706 [0.647, 0.772] | 0.729 [0.655, 0.808] | 0.788 [0.724, 0.855] | **0.844** [0.796, 0.893] |
|
| 159 |
-
| **Mean** | 0.465 [0.329, 0.606] | 0.723 [0.660, 0.781] | 0.706 [0.656, 0.755] | 0.729 [0.664, 0.791] | 0.787 [0.730, 0.840] | **0.842** [0.800, 0.881] |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
### DMC Visual Tasks (500K steps / 1M env steps with action repeat 2)
|
| 163 |
-
|
| 164 |
-
Pixel-based observations at 84×84 resolution. Aggregate metrics computed over the success normalized score.
|
| 165 |
-
|
| 166 |
-
| Task | DrQ-v2 | PPO | TDMPC2 | DreamerV3 | MR.Q | **DR.Q** |
|
| 167 |
-
|---|---|---|---|---|---|---|
|
| 168 |
-
| acrobot-swingup | 168 [127, 219] | 2 [1, 4] | 197 [179, 217] | 121 [106, 145] | 287 [254, 316] | **324** [283, 365] |
|
| 169 |
-
| dog-run | 10 [9, 12] | 11 [9, 14] | 14 [10, 18] | 9 [6, 14] | 60 [44, 80] | **118** [104, 132] |
|
| 170 |
-
| dog-stand | 43 [37, 49] | 51 [48, 56] | 117 [72, 148] | 61 [30, 92] | 216 [201, 232] | **700** [660, 740] |
|
| 171 |
-
| dog-trot | 14 [11, 18] | 13 [12, 15] | 20 [14, 25] | 14 [13, 16] | 65 [55, 79] | **113** [98, 128] |
|
| 172 |
-
| dog-walk | 22 [18, 29] | 16 [14, 18] | 22 [17, 28] | 11 [11, 12] | 77 [71, 83] | **201** [146, 256] |
|
| 173 |
-
| hopper-hop | 224 [170, 278] | 0 [0, 0] | 187 [119, 238] | 205 [125, 287] | 270 [230, 315] | **330** [283, 377] |
|
| 174 |
-
| hopper-stand | 917 [903, 931] | 1 [0, 2] | 582 [321, 794] | 888 [875, 900] | 852 [703, 930] | **937** [930, 944] |
|
| 175 |
-
| humanoid-run | 1 [1, 1] | 1 [1, 1] | 0 [1, 1] | 1 [1, 1] | 1 [1, 2] | **1** [1, 1] |
|
| 176 |
-
| quadruped-run | 459 [412, 507] | 118 [98, 139] | 262 [184, 330] | 328 [255, 397] | 498 [476, 522] | **655** [573, 737] |
|
| 177 |
-
| quadruped-walk | 750 [699, 796] | 149 [113, 184] | 246 [179, 310] | 316 [260, 379] | 833 [797, 867] | **927** [914, 941] |
|
| 178 |
-
| reacher-hard | 705 [580, 831] | 10 [0, 30] | **911** [867, 946] | 338 [227, 461] | 965 [945, 977] | 954 [930, 979] |
|
| 179 |
-
| walker-run | 546 [475, 612] | 39 [35, 44] | 665 [566, 719] | 669 [615, 708] | 615 [571, 655] | **746** [713, 778] |
|
| 180 |
-
| **IQM** | 0.241 [0.214, 0.271] | 0.016 [0.013, 0.018] | 0.154 [0.113, 0.224] | 0.168 [0.152, 0.184] | 0.322 [0.239, 0.423] | **0.494** [0.395, 0.604] |
|
| 181 |
-
| **Median** | 0.191 [0.172, 0.211] | 0.013 [0.012, 0.013] | 0.295 [0.198, 0.339] | 0.134 [0.124, 0.198] | 0.398 [0.320, 0.466] | **0.500** [0.427, 0.576] |
|
| 182 |
-
| **Mean** | 0.321 [0.303, 0.340] | 0.034 [0.031, 0.037] | 0.269 [0.214, 0.326] | 0.247 [0.231, 0.262] | 0.395 [0.335, 0.457] | **0.501** [0.439, 0.564] |
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
### HumanoidBench — Without Dexterous Hands (500K steps / 1M env steps with action repeat 2)
|
| 186 |
-
|
| 187 |
-
Aggregate metrics computed over the success normalized score.
|
| 188 |
-
|
| 189 |
-
| Task | Simba | SimbaV2 | MR.Q | FoG | **DR.Q** |
|
| 190 |
-
|---|---|---|---|---|---|
|
| 191 |
-
| h1-pole-v0 | 716 [667, 765] | 791 [785, 797] | 578 [534, 623] | **893** [846, 940] | 887 [853, 921] |
|
| 192 |
-
| h1-slide-v0 | 277 [252, 303] | 487 [404, 571] | 303 [270, 337] | **674** [562, 785] | 355 [324, 386] |
|
| 193 |
-
| h1-stair-v0 | 269 [153, 385] | **493** [467, 518] | 235 [213, 257] | 466 [383, 548] | 401 [328, 475] |
|
| 194 |
-
| h1-balance-hard-v0 | 75 [71, 80] | 143 [128, 157] | 69 [67, 72] | 81 [71, 91] | **92** [87, 97] |
|
| 195 |
-
| h1-balance-simple-v0 | 337 [193, 482] | **723** [651, 795] | 135 [110, 160] | 616 [536, 696] | 205 [166, 244] |
|
| 196 |
-
| h1-sit-hard-v0 | 512 [354, 670] | 679 [548, 811] | 553 [421, 686] | 770 [738, 802] | **843** [747, 939] |
|
| 197 |
-
| h1-sit-simple-v0 | 833 [814, 853] | 875 [870, 880] | 850 [819, 882] | 828 [800, 856] | **931** [924, 938] |
|
| 198 |
-
| h1-maze-v0 | 354 [342, 366] | 313 [287, 340] | 344 [340, 347] | 331 [310, 353] | **354** [349, 359] |
|
| 199 |
-
| h1-crawl-v0 | 923 [904, 942] | 946 [933, 959] | 932 [919, 945] | 971 [969, 973] | **973** [972, 974] |
|
| 200 |
-
| h1-hurdle-v0 | 175 [150, 201] | 202 [167, 236] | 131 [108, 155] | 114 [100, 129] | **344** [245, 443] |
|
| 201 |
-
| h1-reach-v0 | 3874 [3220, 4527] | 3850 [3272, 4427] | 4902 [4390, 5414] | 2434 [2083, 2785] | **8101** [7640, 8563] |
|
| 202 |
-
| h1-run-v0 | 232 [185, 279] | 415 [307, 524] | 278 [192, 364] | 749 [666, 832] | **820** [815, 824] |
|
| 203 |
-
| h1-stand-v0 | 772 [701, 843] | 814 [770, 857] | 800 [754, 846] | 671 [516, 825] | **856** [815, 897] |
|
| 204 |
-
| h1-walk-v0 | 550 [391, 709] | 845 [840, 850] | 716 [657, 775] | **866** [859, 872] | 850 [830, 869] |
|
| 205 |
-
| **IQM** | 0.521 [0.413, 0.633] | 0.799 [0.686, 0.908] | 0.519 [0.417, 0.630] | 0.846 [0.713, 0.969] | **0.864** [0.735, 0.976] |
|
| 206 |
-
| **Median** | 0.598 [0.514, 0.692] | 0.781 [0.693, 0.865] | 0.602 [0.516, 0.687] | 0.794 [0.705, 0.899] | **0.823** [0.733, 0.920] |
|
| 207 |
-
| **Mean** | 0.606 [0.536, 0.678] | 0.776 [0.705, 0.849] | 0.604 [0.531, 0.677] | 0.802 [0.721, 0.883] | **0.825** [0.748, 0.902] |
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
### HumanoidBench — With Dexterous Hands (500K steps / 1M env steps with action repeat 2)
|
| 211 |
-
|
| 212 |
-
Aggregate metrics computed over the success normalized score.
|
| 213 |
-
|
| 214 |
-
| Task | DreamerV3 | TDMPC2 | SimBa | SimbaV2 | MR.Q | FoG | **DR.Q** |
|
| 215 |
-
|---|---|---|---|---|---|---|---|
|
| 216 |
-
| h1hand-door-v0 | 10 [7, 13] | 134 [23, 246] | 206 [169, 244] | 310 [302, 318] | 293 [280, 305] | 244 [227, 261] | **320** [308, 333] |
|
| 217 |
-
| h1hand-slide-v0 | 21 [19, 23] | 79 [68, 90] | 67 [55, 79] | 136 [97, 175] | 146 [131, 161] | 201 [173, 228] | **285** [258, 312] |
|
| 218 |
-
| h1hand-stair-v0 | 16 [8, 25] | 43 [35, 51] | 61 [44, 78] | 120 [89, 151] | 127 [104, 150] | **135** [126, 144] | 288 [193, 382] |
|
| 219 |
-
| h1hand-bookshelf-simple-v0 | 45 [41, 50] | 97 [59, 134] | 487 [315, 660] | **838** [834, 843] | 691 [599, 783] | 610 [523, 697] | 709 [572, 846] |
|
| 220 |
-
| h1hand-bookshelf-hard-v0 | 27 [24, 30] | 34 [19, 50] | 490 [447, 533] | 496 [417, 575] | 332 [240, 425] | **577** [548, 605] | 349 [262, 435] |
|
| 221 |
-
| h1hand-sit-simple-v0 | 48 [42, 54] | 607 [268, 947] | 643 [580, 705] | 927 [904, 951] | 653 [568, 737] | 631 [528, 735] | **942** [926, 958] |
|
| 222 |
-
| h1hand-sit-hard-v0 | 15 [11, 20] | 139 [86, 193] | 649 [500, 797] | 724 [609, 838] | 487 [353, 621] | 179 [128, 229] | **891** [841, 941] |
|
| 223 |
-
| h1hand-basketball-v0 | 13 [12, 13] | 47 [21, 73] | 54 [25, 83] | 56 [34, 78] | 53 [34, 72] | **182** [131, 232] | 75 [45, 105] |
|
| 224 |
-
| h1hand-pole-v0 | 48 [36, 60] | 99 [87, 111] | 224 [195, 254] | **493** [426, 559] | 237 [202, 273] | 257 [237, 277] | 424 [299, 549] |
|
| 225 |
-
| h1hand-crawl-v0 | 256 [244, 268] | **897** [858, 935] | 779 [748, 809] | 640 [549, 732] | 807 [783, 831] | 794 [721, 866] | 526 [477, 574] |
|
| 226 |
-
| h1hand-reach-v0 | 864 [578, 1150] | 3610 [2912, 4309] | 3185 [2664, 3707] | 3223 [2703, 3744] | 4101 [3540, 4662] | 2877 [2487, 3267] | **4950** [4280, 5619] |
|
| 227 |
-
| h1hand-run-v0 | 6 [4, 8] | 29 [27, 30] | 31 [24, 37] | 30 [22, 38] | **35** [29, 41] | 22 [19, 25] | 129 [77, 181] |
|
| 228 |
-
| h1hand-stand-v0 | 41 [38, 44] | 193 [147, 238] | 127 [72, 181] | 103 [81, 126] | 300 [194, 405] | 79 [66, 91] | **491** [344, 638] |
|
| 229 |
-
| h1hand-walk-v0 | 19 [12, 27] | 234 [125, 343] | 94 [79, 109] | 64 [52, 76] | 95 [77, 112] | 75 [63, 87] | **512** [371, 652] |
|
| 230 |
-
| **IQM** | 0.019 [0.013, 0.026] | 0.150 [0.091, 0.224] | 0.219 [0.179, 0.267] | 0.298 [0.241, 0.374] | 0.286 [0.245, 0.333] | 0.254 [0.222, 0.285] | **0.452** [0.400, 0.512] |
|
| 231 |
-
| **Median** | 0.021 [0.010, 0.030] | 0.298 [0.147, 0.433] | 0.356 [0.269, 0.413] | 0.420 [0.338, 0.491] | 0.388 [0.313, 0.449] | 0.342 [0.268, 0.395] | **0.529** [0.455, 0.607] |
|
| 232 |
-
| **Mean** | 0.020 [0.011, 0.028] | 0.282 [0.169, 0.413] | 0.345 [0.286, 0.406] | 0.417 [0.356, 0.482] | 0.385 [0.329, 0.443] | 0.336 [0.285, 0.393] | **0.534** [0.473, 0.595] |
|
| 233 |
-
|
| 234 |
|
| 235 |
## Citation
|
| 236 |
|
|
@@ -244,8 +125,6 @@ Aggregate metrics computed over the success normalized score.
|
|
| 244 |
}
|
| 245 |
```
|
| 246 |
|
| 247 |
-
|
| 248 |
## Acknowledgements
|
| 249 |
|
| 250 |
-
DR.Q builds upon the [MR.Q codebase](https://github.com/facebookresearch/MRQ) by Facebook Research. We thank the authors of TD7, TDMPC2, MR.Q, FoG, SimBa, SimbaV2, DrQ-v2, DreamerV3, and PPO for their open-source implementations used as baselines.
|
| 251 |
-
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: pytorch
|
| 3 |
+
license: mit
|
| 4 |
pipeline_tag: reinforcement-learning
|
| 5 |
+
tags:
|
| 6 |
+
- reinforcement-learning
|
| 7 |
+
- continuous-control
|
| 8 |
+
- model-based-representation
|
| 9 |
+
- mujoco
|
| 10 |
+
- deepmind-control-suite
|
| 11 |
+
- humanoidbench
|
| 12 |
+
- pytorch
|
| 13 |
+
- td3
|
| 14 |
+
- representation-learning
|
| 15 |
---
|
| 16 |
|
| 17 |
# DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control
|
| 18 |
|
| 19 |
[](https://openreview.net/forum?id=ZP1p8k106p)
|
| 20 |
+
[](https://huggingface.co/papers/2605.11711)
|
| 21 |
[](https://github.com/dmksjfl/DR.Q)
|
| 22 |
[](https://github.com/dmksjfl/DR.Q/blob/master/LICENSE)
|
| 23 |
|
|
|
|
| 25 |
|
| 26 |
> **Authors:** Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
|
| 27 |
|
| 28 |
+
## 🔍 Overview
|
| 29 |
+
|
| 30 |
+
The framework of DR.Q is shown below:
|
| 31 |
+
|
| 32 |
+

|
| 33 |
|
| 34 |
## Model Description
|
| 35 |
|
|
|
|
| 71 |
|
| 72 |
### Loading Pretrained Weights
|
| 73 |
|
| 74 |
+
Pretrained model weights for all reported tasks are hosted here on HuggingFace.
|
| 75 |
|
| 76 |
## Training Details
|
| 77 |
|
|
|
|
| 98 |
|
| 99 |
All results report the **final average return** at the end of training. Aggregate metrics (IQM, Median, Mean) are computed over the task-specific normalized score. Values in [brackets] denote **95% bootstrap confidence intervals**.
|
| 100 |
|
|
|
|
| 101 |
### Gym MuJoCo Tasks (1M environment steps)
|
| 102 |
|
|
|
|
|
|
|
| 103 |
| Task | TD7 | TDMPC2 | MR.Q | FoG | SimbaV2 | **DR.Q** |
|
| 104 |
|---|---|---|---|---|---|---|
|
| 105 |
| Ant-v4 | 8509 [8168, 8844] | 4751 [2988, 6145] | 6901 [6261, 7482] | 6761 [6161, 7360] | 7429 [7209, 7649] | **8138** [7764, 8511] |
|
|
|
|
| 111 |
| **Median** | 1.550 [1.450, 1.630] | 1.180 [0.830, 1.220] | 1.488 [1.340, 1.623] | 1.261 [1.080, 1.344] | 1.616 [1.490, 1.744] | **1.564** [1.416, 1.806] |
|
| 112 |
| **Mean** | 1.570 [1.540, 1.600] | 1.040 [0.920, 1.150] | 1.465 [1.346, 1.585] | 1.196 [1.082, 1.307] | 1.617 [1.513, 1.718] | **1.608** [1.449, 1.759] |
|
| 113 |
|
| 114 |
+
*(Additional benchmark results for DMC and HumanoidBench are available in the paper).*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
## Citation
|
| 117 |
|
|
|
|
| 125 |
}
|
| 126 |
```
|
| 127 |
|
|
|
|
| 128 |
## Acknowledgements
|
| 129 |
|
| 130 |
+
DR.Q builds upon the [MR.Q codebase](https://github.com/facebookresearch/MRQ) by Facebook Research. We thank the authors of TD7, TDMPC2, MR.Q, FoG, SimBa, SimbaV2, DrQ-v2, DreamerV3, and PPO for their open-source implementations used as baselines.
|
|
|