dmux commited on
Commit
ceed1af
·
verified ·
1 Parent(s): 8eb52a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +260 -0
README.md CHANGED
@@ -1,3 +1,263 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - reinforcement-learning
5
+ - continuous-control
6
+ - model-based-representation
7
+ - mujoco
8
+ - deepmind-control-suite
9
+ - humanoidbench
10
+ - pytorch
11
+ - td3
12
+ - representation-learning
13
+ library_name: pytorch
14
+ pipeline_tag: reinforcement-learning
15
  ---
16
+
17
+ # DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control
18
+
19
+ [![Paper](https://img.shields.io/badge/Paper-ICML2026-purple)](https://openreview.net/forum?id=ZP1p8k106p)
20
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/dmksjfl/DR.Q)
21
+ [![License](https://img.shields.io/badge/License-MIT-green)](https://github.com/dmksjfl/DR.Q/blob/master/LICENSE)
22
+
23
+ Official pretrained model weights for **DR.Q**, presented at the **Forty-third International Conference on Machine Learning (ICML 2026)**.
24
+
25
+ > **Authors:** Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
26
+
27
+ ---
28
+
29
+ ## Model Description
30
+
31
+ DR.Q is a **model-free reinforcement learning** algorithm that achieves strong sample efficiency in continuous control by learning *debiased* model-based representations. The key insight is that naively training model-based representations can introduce representation bias that hurts downstream policy learning. DR.Q debiases two sources of biases:
32
+
33
+ 1. **Representation bias** — mitigated by adding InfoNCE loss besides the MSE loss
34
+ 2. **Sampling bias** — mitigated by introducing the faded prioritized experience replay (Faded PER)
35
+
36
+ DR.Q builds upon and substantially extends the [MR.Q codebase](https://github.com/facebookresearch/MRQ) (Facebook Research).
37
+
38
+ ## How to Get Started
39
+
40
+ ### Installation
41
+
42
+ ```bash
43
+ git clone https://github.com/dmksjfl/DR.Q
44
+ cd DR.Q
45
+ pip install -r requirements.txt
46
+ ```
47
+
48
+ ### Training
49
+
50
+ ```bash
51
+ # Gym / MuJoCo (1M steps)
52
+ python main.py --env Gym-HalfCheetah-v4
53
+ python main.py --env Gym-Humanoid-v4
54
+
55
+ # DeepMind Control Suite — proprioceptive (500K steps)
56
+ python main.py --env Dmc-cheetah-run
57
+ python main.py --env Dmc-quadruped-walk
58
+
59
+ # DeepMind Control Suite — pixel observations
60
+ python main.py --env Dmc-visual-dog-run
61
+ python main.py --env Dmc-visual-walker-walk
62
+
63
+ # HumanoidBench (requires separate installation)
64
+ python main.py --env HBench-h1-run-v0
65
+ ```
66
+
67
+ ### Loading Pretrained Weights
68
+
69
+ Pretrained model weights for all reported tasks are hosted here on HuggingFace
70
+
71
+ ---
72
+
73
+ ## Training Details
74
+
75
+ ### Evaluated Benchmark Suites
76
+
77
+ | Suite | Obs. Type | Steps | Tasks |
78
+ |---|---|---|---|
79
+ | Gym MuJoCo (Gymnasium) | Proprioceptive | 1M | 5 tasks |
80
+ | DeepMind Control (DMC) — Easy | Proprioceptive | 500K | 21 tasks |
81
+ | DeepMind Control (DMC) — Hard | Proprioceptive | 500K | 7 tasks |
82
+ | DeepMind Control (DMC) — Visual | Pixel (84×84) | 500K | 12 tasks |
83
+ | HumanoidBench (w/o hands) | Proprioceptive | 500K | 14 tasks |
84
+ | HumanoidBench (w/ hands) | Proprioceptive | 500K | 14 tasks |
85
+
86
+ ### Training Infrastructure
87
+
88
+ - **Framework:** PyTorch ≥ 2.3.0
89
+ - **Python:** 3.11 (compatible with 3.9–3.12)
90
+ - **Hardware:** CUDA GPU (CPU also supported)
91
+ - **Seeds:** Results averaged over 10 random seeds with 95% bootstrap confidence intervals
92
+
93
+ ---
94
+
95
+ ## Evaluation Results
96
+
97
+ All results report the **final average return** at the end of training. Aggregate metrics (IQM, Median, Mean) are computed over the task-specific normalized score. Values in [brackets] denote **95% bootstrap confidence intervals**.
98
+
99
+ ---
100
+
101
+ ### Gym MuJoCo Tasks (1M environment steps)
102
+
103
+ Full comparison against domain-specific and general model-free / model-based RL algorithms. Aggregate metrics are computed over the TD3-normalized score.
104
+
105
+ | Task | TD7 | TDMPC2 | MR.Q | FoG | SimbaV2 | **DR.Q** |
106
+ |---|---|---|---|---|---|---|
107
+ | Ant-v4 | 8509 [8168, 8844] | 4751 [2988, 6145] | 6901 [6261, 7482] | 6761 [6161, 7360] | 7429 [7209, 7649] | **8138** [7764, 8511] |
108
+ | HalfCheetah-v4 | **17433** [17301, 17559] | 15078 [14065, 15932] | 12939 [11663, 13762] | 11709 [9928, 13491] | 12022 [11640, 12404] | 14775 [14638, 14912] |
109
+ | Hopper-v4 | 3511 [3236, 3736] | 2081 [1197, 2921] | 2692 [2131, 3309] | 1822 [1316, 2327] | **4054** [3929, 4179] | 2504 [1931, 3077] |
110
+ | Humanoid-v4 | 7428 [7304, 7553] | 6071 [5770, 6333] | 10223 [9929, 10498] | 6737 [6319, 7155] | 10546 [10195, 10897] | **11239** [11052, 11426] |
111
+ | Walker2d-v4 | 6096 [5621, 6547] | 3008 [1706, 4321] | 6039 [5644, 6386] | 5124 [4719, 5529] | **6938** [6691, 7185] | 6422 [5123, 7721] |
112
+ | **IQM** | 1.540 [1.500, 1.580] | 1.050 [0.890, 1.190] | 1.499 [1.361, 1.650] | 1.242 [1.117, 1.349] | 1.637 [1.470, 1.791] | **1.691** [1.473, 1.879] |
113
+ | **Median** | 1.550 [1.450, 1.630] | 1.180 [0.830, 1.220] | 1.488 [1.340, 1.623] | 1.261 [1.080, 1.344] | 1.616 [1.490, 1.744] | **1.564** [1.416, 1.806] |
114
+ | **Mean** | 1.570 [1.540, 1.600] | 1.040 [0.920, 1.150] | 1.465 [1.346, 1.585] | 1.196 [1.082, 1.307] | 1.617 [1.513, 1.718] | **1.608** [1.449, 1.759] |
115
+
116
+ ---
117
+
118
+ ### DMC-Easy Tasks (500K steps / 1M env steps with action repeat 2)
119
+
120
+ Aggregate metrics reported in units of 1k.
121
+
122
+ | Task | MR.Q | Simba | SimbaV2 | FoG | **DR.Q** |
123
+ |---|---|---|---|---|---|
124
+ | acrobot-swingup | 567 [523, 616] | 431 [379, 482] | 436 [391, 482] | 414 [344, 485] | **569** [519, 619] |
125
+ | ball-in-cup-catch | 981 [979, 984] | 981 [978, 983] | 982 [980, 984] | **983** [981, 985] | 980 [979, 982] |
126
+ | cartpole-balance | **999** [999, 1000] | 998 [998, 999] | 999 [999, 999] | 997 [996, 999] | **999** [999, 1000] |
127
+ | cartpole-balance-sparse | **1000** [1000, 1000] | 991 [973, 1008] | 967 [904, 1030] | **1000** [1000, 1000] | 987 [963, 1012] |
128
+ | cartpole-swingup | 866 [866, 866] | 876 [871, 881] | 880 [876, 883] | **881** [880, 882] | 867 [866, 867] |
129
+ | cartpole-swingup-sparse | 798 [780, 818] | 825 [795, 854] | **848** [848, 849] | 840 [829, 850] | 805 [791, 818] |
130
+ | cheetah-run | 877 [849, 905] | **920** [918, 922] | 821 [642, 913] | 838 [732, 944] | 911 [905, 918] |
131
+ | finger-spin | 937 [917, 956] | 849 [758, 939] | 891 [810, 972] | **987** [986, 989] | 949 [917, 980] |
132
+ | finger-turn-easy | 953 [931, 974] | 935 [903, 968] | 953 [925, 980] | 949 [920, 977] | **956** [932, 980] |
133
+ | finger-turn-hard | 950 [910, 974] | 915 [859, 972] | **951** [925, 977] | 921 [863, 978] | 949 [923, 975] |
134
+ | fish-swim | 792 [773, 810] | 823 [799, 846] | **826** [806, 846] | 744 [701, 786] | 808 [788, 828] |
135
+ | hopper-hop | 251 [195, 301] | **385** [322, 449] | 290 [233, 348] | 335 [326, 345] | 384 [317, 451] |
136
+ | hopper-stand | 951 [948, 955] | 929 [900, 957] | 944 [926, 962] | **956** [953, 959] | 954 [949, 959] |
137
+ | pendulum-swingup | 748 [597, 829] | 737 [575, 899] | 827 [805, 849] | 838 [810, 866] | **835** [819, 852] |
138
+ | quadruped-run | 947 [940, 954] | 928 [916, 939] | 935 [928, 943] | 918 [906, 929] | **953** [949, 957] |
139
+ | quadruped-walk | 963 [959, 967] | 957 [951, 963] | 962 [955, 969] | 963 [960, 966] | **969** [964, 973] |
140
+ | reacher-easy | **983** [983, 985] | **983** [981, 986] | **983** [979, 986] | 980 [971, 990] | 975 [958, 993] |
141
+ | reacher-hard | **977** [975, 980] | 966 [947, 984] | 967 [946, 987] | 965 [944, 986] | 976 [973, 979] |
142
+ | walker-run | 793 [765, 815] | 796 [792, 801] | 817 [812, 821] | **851** [848, 853] | 809 [775, 844] |
143
+ | walker-stand | 988 [987, 990] | 985 [982, 989] | 987 [984, 990] | 987 [985, 989] | **991** [989, 992] |
144
+ | walker-walk | 978 [978, 980] | 975 [972, 978] | 976 [974, 978] | 978 [977, 980] | **979** [976, 982] |
145
+ | **IQM** | 0.936 [0.917, 0.952] | 0.922 [0.905, 0.938] | 0.933 [0.918, 0.948] | 0.935 [0.919, 0.951] | **0.937** [0.920, 0.951] |
146
+ | **Median** | 0.876 [0.847, 0.905] | 0.870 [0.841, 0.896] | 0.875 [0.847, 0.905] | 0.874 [0.845, 0.904] | **0.885** [0.863, 0.912] |
147
+ | **Mean** | 0.874 [0.848, 0.898] | 0.864 [0.840, 0.887] | 0.874 [0.849, 0.897] | 0.873 [0.847, 0.897] | **0.886** [0.865, 0.906] |
148
+
149
+ ---
150
+
151
+ ### DMC-Hard Tasks (500K steps / 1M env steps with action repeat 2)
152
+
153
+ Aggregate metrics reported in units of 1k.
154
+
155
+ | Task | TDMPC2 | MR.Q | Simba | SimbaV2 | FoG | **DR.Q** |
156
+ |---|---|---|---|---|---|---|
157
+ | dog-run | 265 [166, 342] | 569 [547, 595] | 544 [525, 564] | 562 [516, 608] | 613 [577, 648] | **721** [684, 758] |
158
+ | dog-stand | 506 [266, 715] | 967 [960, 975] | 960 [951, 969] | **981** [977, 985] | 976 [969, 982] | 972 [963, 982] |
159
+ | dog-trot | 407 [265, 530] | 877 [845, 898] | 824 [773, 876] | 861 [772, 950] | 901 [892, 911] | **925** [914, 936] |
160
+ | dog-walk | 486 [240, 704] | 916 [908, 924] | 916 [905, 928] | 935 [927, 944] | 921 [909, 933] | **950** [942, 958] |
161
+ | humanoid-run | 181 [121, 231] | 200 [170, 236] | 181 [171, 191] | 194 [182, 207] | 292 [268, 317] | **465** [444, 485] |
162
+ | humanoid-stand | 658 [506, 745] | 868 [822, 903] | 846 [801, 890] | 916 [886, 945] | 931 [921, 941] | **938** [932, 944] |
163
+ | humanoid-walk | 754 [725, 791] | 662 [610, 724] | 668 [608, 728] | 651 [590, 713] | 878 [839, 917] | **925** [918, 932] |
164
+ | **IQM** | 0.464 [0.305, 0.632] | 0.796 [0.724, 0.860] | 0.773 [0.713, 0.830] | 0.808 [0.726, 0.879] | 0.880 [0.818, 0.914] | **0.917** [0.871, 0.936] |
165
+ | **Median** | 0.486 [0.265, 0.658] | 0.722 [0.654, 0.797] | 0.706 [0.647, 0.772] | 0.729 [0.655, 0.808] | 0.788 [0.724, 0.855] | **0.844** [0.796, 0.893] |
166
+ | **Mean** | 0.465 [0.329, 0.606] | 0.723 [0.660, 0.781] | 0.706 [0.656, 0.755] | 0.729 [0.664, 0.791] | 0.787 [0.730, 0.840] | **0.842** [0.800, 0.881] |
167
+
168
+ ---
169
+
170
+ ### DMC Visual Tasks (500K steps / 1M env steps with action repeat 2)
171
+
172
+ Pixel-based observations at 84×84 resolution. Aggregate metrics computed over the success normalized score.
173
+
174
+ | Task | DrQ-v2 | PPO | TDMPC2 | DreamerV3 | MR.Q | **DR.Q** |
175
+ |---|---|---|---|---|---|---|
176
+ | acrobot-swingup | 168 [127, 219] | 2 [1, 4] | 197 [179, 217] | 121 [106, 145] | 287 [254, 316] | **324** [283, 365] |
177
+ | dog-run | 10 [9, 12] | 11 [9, 14] | 14 [10, 18] | 9 [6, 14] | 60 [44, 80] | **118** [104, 132] |
178
+ | dog-stand | 43 [37, 49] | 51 [48, 56] | 117 [72, 148] | 61 [30, 92] | 216 [201, 232] | **700** [660, 740] |
179
+ | dog-trot | 14 [11, 18] | 13 [12, 15] | 20 [14, 25] | 14 [13, 16] | 65 [55, 79] | **113** [98, 128] |
180
+ | dog-walk | 22 [18, 29] | 16 [14, 18] | 22 [17, 28] | 11 [11, 12] | 77 [71, 83] | **201** [146, 256] |
181
+ | hopper-hop | 224 [170, 278] | 0 [0, 0] | 187 [119, 238] | 205 [125, 287] | 270 [230, 315] | **330** [283, 377] |
182
+ | hopper-stand | 917 [903, 931] | 1 [0, 2] | 582 [321, 794] | 888 [875, 900] | 852 [703, 930] | **937** [930, 944] |
183
+ | humanoid-run | 1 [1, 1] | 1 [1, 1] | 0 [1, 1] | 1 [1, 1] | 1 [1, 2] | **1** [1, 1] |
184
+ | quadruped-run | 459 [412, 507] | 118 [98, 139] | 262 [184, 330] | 328 [255, 397] | 498 [476, 522] | **655** [573, 737] |
185
+ | quadruped-walk | 750 [699, 796] | 149 [113, 184] | 246 [179, 310] | 316 [260, 379] | 833 [797, 867] | **927** [914, 941] |
186
+ | reacher-hard | 705 [580, 831] | 10 [0, 30] | **911** [867, 946] | 338 [227, 461] | 965 [945, 977] | 954 [930, 979] |
187
+ | walker-run | 546 [475, 612] | 39 [35, 44] | 665 [566, 719] | 669 [615, 708] | 615 [571, 655] | **746** [713, 778] |
188
+ | **IQM** | 0.241 [0.214, 0.271] | 0.016 [0.013, 0.018] | 0.154 [0.113, 0.224] | 0.168 [0.152, 0.184] | 0.322 [0.239, 0.423] | **0.494** [0.395, 0.604] |
189
+ | **Median** | 0.191 [0.172, 0.211] | 0.013 [0.012, 0.013] | 0.295 [0.198, 0.339] | 0.134 [0.124, 0.198] | 0.398 [0.320, 0.466] | **0.500** [0.427, 0.576] |
190
+ | **Mean** | 0.321 [0.303, 0.340] | 0.034 [0.031, 0.037] | 0.269 [0.214, 0.326] | 0.247 [0.231, 0.262] | 0.395 [0.335, 0.457] | **0.501** [0.439, 0.564] |
191
+
192
+ ---
193
+
194
+ ### HumanoidBench — Without Dexterous Hands (500K steps / 1M env steps with action repeat 2)
195
+
196
+ Aggregate metrics computed over the success normalized score.
197
+
198
+ | Task | Simba | SimbaV2 | MR.Q | FoG | **DR.Q** |
199
+ |---|---|---|---|---|---|
200
+ | h1-pole-v0 | 716 [667, 765] | 791 [785, 797] | 578 [534, 623] | **893** [846, 940] | 887 [853, 921] |
201
+ | h1-slide-v0 | 277 [252, 303] | 487 [404, 571] | 303 [270, 337] | **674** [562, 785] | 355 [324, 386] |
202
+ | h1-stair-v0 | 269 [153, 385] | **493** [467, 518] | 235 [213, 257] | 466 [383, 548] | 401 [328, 475] |
203
+ | h1-balance-hard-v0 | 75 [71, 80] | 143 [128, 157] | 69 [67, 72] | 81 [71, 91] | **92** [87, 97] |
204
+ | h1-balance-simple-v0 | 337 [193, 482] | **723** [651, 795] | 135 [110, 160] | 616 [536, 696] | 205 [166, 244] |
205
+ | h1-sit-hard-v0 | 512 [354, 670] | 679 [548, 811] | 553 [421, 686] | 770 [738, 802] | **843** [747, 939] |
206
+ | h1-sit-simple-v0 | 833 [814, 853] | 875 [870, 880] | 850 [819, 882] | 828 [800, 856] | **931** [924, 938] |
207
+ | h1-maze-v0 | 354 [342, 366] | 313 [287, 340] | 344 [340, 347] | 331 [310, 353] | **354** [349, 359] |
208
+ | h1-crawl-v0 | 923 [904, 942] | 946 [933, 959] | 932 [919, 945] | 971 [969, 973] | **973** [972, 974] |
209
+ | h1-hurdle-v0 | 175 [150, 201] | 202 [167, 236] | 131 [108, 155] | 114 [100, 129] | **344** [245, 443] |
210
+ | h1-reach-v0 | 3874 [3220, 4527] | 3850 [3272, 4427] | 4902 [4390, 5414] | 2434 [2083, 2785] | **8101** [7640, 8563] |
211
+ | h1-run-v0 | 232 [185, 279] | 415 [307, 524] | 278 [192, 364] | 749 [666, 832] | **820** [815, 824] |
212
+ | h1-stand-v0 | 772 [701, 843] | 814 [770, 857] | 800 [754, 846] | 671 [516, 825] | **856** [815, 897] |
213
+ | h1-walk-v0 | 550 [391, 709] | 845 [840, 850] | 716 [657, 775] | **866** [859, 872] | 850 [830, 869] |
214
+ | **IQM** | 0.521 [0.413, 0.633] | 0.799 [0.686, 0.908] | 0.519 [0.417, 0.630] | 0.846 [0.713, 0.969] | **0.864** [0.735, 0.976] |
215
+ | **Median** | 0.598 [0.514, 0.692] | 0.781 [0.693, 0.865] | 0.602 [0.516, 0.687] | 0.794 [0.705, 0.899] | **0.823** [0.733, 0.920] |
216
+ | **Mean** | 0.606 [0.536, 0.678] | 0.776 [0.705, 0.849] | 0.604 [0.531, 0.677] | 0.802 [0.721, 0.883] | **0.825** [0.748, 0.902] |
217
+
218
+ ---
219
+
220
+ ### HumanoidBench — With Dexterous Hands (500K steps / 1M env steps with action repeat 2)
221
+
222
+ Aggregate metrics computed over the success normalized score.
223
+
224
+ | Task | DreamerV3 | TDMPC2 | SimBa | SimbaV2 | MR.Q | FoG | **DR.Q** |
225
+ |---|---|---|---|---|---|---|---|
226
+ | h1hand-door-v0 | 10 [7, 13] | 134 [23, 246] | 206 [169, 244] | 310 [302, 318] | 293 [280, 305] | 244 [227, 261] | **320** [308, 333] |
227
+ | h1hand-slide-v0 | 21 [19, 23] | 79 [68, 90] | 67 [55, 79] | 136 [97, 175] | 146 [131, 161] | 201 [173, 228] | **285** [258, 312] |
228
+ | h1hand-stair-v0 | 16 [8, 25] | 43 [35, 51] | 61 [44, 78] | 120 [89, 151] | 127 [104, 150] | **135** [126, 144] | 288 [193, 382] |
229
+ | h1hand-bookshelf-simple-v0 | 45 [41, 50] | 97 [59, 134] | 487 [315, 660] | **838** [834, 843] | 691 [599, 783] | 610 [523, 697] | 709 [572, 846] |
230
+ | h1hand-bookshelf-hard-v0 | 27 [24, 30] | 34 [19, 50] | 490 [447, 533] | 496 [417, 575] | 332 [240, 425] | **577** [548, 605] | 349 [262, 435] |
231
+ | h1hand-sit-simple-v0 | 48 [42, 54] | 607 [268, 947] | 643 [580, 705] | 927 [904, 951] | 653 [568, 737] | 631 [528, 735] | **942** [926, 958] |
232
+ | h1hand-sit-hard-v0 | 15 [11, 20] | 139 [86, 193] | 649 [500, 797] | 724 [609, 838] | 487 [353, 621] | 179 [128, 229] | **891** [841, 941] |
233
+ | h1hand-basketball-v0 | 13 [12, 13] | 47 [21, 73] | 54 [25, 83] | 56 [34, 78] | 53 [34, 72] | **182** [131, 232] | 75 [45, 105] |
234
+ | h1hand-pole-v0 | 48 [36, 60] | 99 [87, 111] | 224 [195, 254] | **493** [426, 559] | 237 [202, 273] | 257 [237, 277] | 424 [299, 549] |
235
+ | h1hand-crawl-v0 | 256 [244, 268] | **897** [858, 935] | 779 [748, 809] | 640 [549, 732] | 807 [783, 831] | 794 [721, 866] | 526 [477, 574] |
236
+ | h1hand-reach-v0 | 864 [578, 1150] | 3610 [2912, 4309] | 3185 [2664, 3707] | 3223 [2703, 3744] | 4101 [3540, 4662] | 2877 [2487, 3267] | **4950** [4280, 5619] |
237
+ | h1hand-run-v0 | 6 [4, 8] | 29 [27, 30] | 31 [24, 37] | 30 [22, 38] | **35** [29, 41] | 22 [19, 25] | 129 [77, 181] |
238
+ | h1hand-stand-v0 | 41 [38, 44] | 193 [147, 238] | 127 [72, 181] | 103 [81, 126] | 300 [194, 405] | 79 [66, 91] | **491** [344, 638] |
239
+ | h1hand-walk-v0 | 19 [12, 27] | 234 [125, 343] | 94 [79, 109] | 64 [52, 76] | 95 [77, 112] | 75 [63, 87] | **512** [371, 652] |
240
+ | **IQM** | 0.019 [0.013, 0.026] | 0.150 [0.091, 0.224] | 0.219 [0.179, 0.267] | 0.298 [0.241, 0.374] | 0.286 [0.245, 0.333] | 0.254 [0.222, 0.285] | **0.452** [0.400, 0.512] |
241
+ | **Median** | 0.021 [0.010, 0.030] | 0.298 [0.147, 0.433] | 0.356 [0.269, 0.413] | 0.420 [0.338, 0.491] | 0.388 [0.313, 0.449] | 0.342 [0.268, 0.395] | **0.529** [0.455, 0.607] |
242
+ | **Mean** | 0.020 [0.011, 0.028] | 0.282 [0.169, 0.413] | 0.345 [0.286, 0.406] | 0.417 [0.356, 0.482] | 0.385 [0.329, 0.443] | 0.336 [0.285, 0.393] | **0.534** [0.473, 0.595] |
243
+
244
+ ---
245
+
246
+ ## Citation
247
+
248
+ ```bibtex
249
+ @inproceedings{lyu2026debiased,
250
+ title={Debiased Model-based Representations for Sample-efficient Continuous Control},
251
+ author={Jiafei Lyu and Zichuan Lin and Scott Fujimoto and Kai Yang and Yangkun Chen and Saiyong Yang and Zongqing Lu and Deheng Ye},
252
+ booktitle={Forty-third International Conference on Machine Learning},
253
+ year={2026},
254
+ url={https://openreview.net/forum?id=ZP1p8k106p}
255
+ }
256
+ ```
257
+
258
+ ---
259
+
260
+ ## Acknowledgements
261
+
262
+ DR.Q builds upon the [MR.Q codebase](https://github.com/facebookresearch/MRQ) by Facebook Research. We thank the authors of TD7, TDMPC2, MR.Q, FoG, SimBa, SimbaV2, DrQ-v2, DreamerV3, and PPO for their open-source implementations used as baselines.
263
+