DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control
Official pretrained model weights for DR.Q, presented at the Forty-third International Conference on Machine Learning (ICML 2026).
Authors: Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
Model Description
DR.Q is a model-free reinforcement learning algorithm that achieves strong sample efficiency in continuous control by learning debiased model-based representations. The key insight is that naively training model-based representations can introduce representation bias that hurts downstream policy learning. DR.Q debiases two sources of biases:
- Representation bias β mitigated by adding InfoNCE loss besides the MSE loss
- Sampling bias β mitigated by introducing the faded prioritized experience replay (Faded PER)
DR.Q builds upon and substantially extends the MR.Q codebase (Facebook Research).
How to Get Started
Installation
git clone https://github.com/dmksjfl/DR.Q
cd DR.Q
pip install -r requirements.txt
Training
# Gym / MuJoCo (1M steps)
python main.py --env Gym-HalfCheetah-v4
python main.py --env Gym-Humanoid-v4
# DeepMind Control Suite β proprioceptive (500K steps)
python main.py --env Dmc-cheetah-run
python main.py --env Dmc-quadruped-walk
# DeepMind Control Suite β pixel observations
python main.py --env Dmc-visual-dog-run
python main.py --env Dmc-visual-walker-walk
# HumanoidBench (requires separate installation)
python main.py --env HBench-h1-run-v0
Loading Pretrained Weights
Pretrained model weights for all reported tasks are hosted here on HuggingFace
Training Details
Evaluated Benchmark Suites
| Suite | Obs. Type | Steps | Tasks |
|---|---|---|---|
| Gym MuJoCo (Gymnasium) | Proprioceptive | 1M | 5 tasks |
| DeepMind Control (DMC) β Easy | Proprioceptive | 500K | 21 tasks |
| DeepMind Control (DMC) β Hard | Proprioceptive | 500K | 7 tasks |
| DeepMind Control (DMC) β Visual | Pixel (84Γ84) | 500K | 12 tasks |
| HumanoidBench (w/o hands) | Proprioceptive | 500K | 14 tasks |
| HumanoidBench (w/ hands) | Proprioceptive | 500K | 14 tasks |
Training Infrastructure
- Framework: PyTorch β₯ 2.3.0
- Python: 3.11 (compatible with 3.9β3.12)
- Hardware: CUDA GPU (CPU also supported)
- Seeds: Results averaged over 10 random seeds with 95% bootstrap confidence intervals
Evaluation Results
All results report the final average return at the end of training. Aggregate metrics (IQM, Median, Mean) are computed over the task-specific normalized score. Values in [brackets] denote 95% bootstrap confidence intervals.
Gym MuJoCo Tasks (1M environment steps)
Full comparison against domain-specific and general model-free / model-based RL algorithms. Aggregate metrics are computed over the TD3-normalized score.
| Task | TD7 | TDMPC2 | MR.Q | FoG | SimbaV2 | DR.Q |
|---|---|---|---|---|---|---|
| Ant-v4 | 8509 [8168, 8844] | 4751 [2988, 6145] | 6901 [6261, 7482] | 6761 [6161, 7360] | 7429 [7209, 7649] | 8138 [7764, 8511] |
| HalfCheetah-v4 | 17433 [17301, 17559] | 15078 [14065, 15932] | 12939 [11663, 13762] | 11709 [9928, 13491] | 12022 [11640, 12404] | 14775 [14638, 14912] |
| Hopper-v4 | 3511 [3236, 3736] | 2081 [1197, 2921] | 2692 [2131, 3309] | 1822 [1316, 2327] | 4054 [3929, 4179] | 2504 [1931, 3077] |
| Humanoid-v4 | 7428 [7304, 7553] | 6071 [5770, 6333] | 10223 [9929, 10498] | 6737 [6319, 7155] | 10546 [10195, 10897] | 11239 [11052, 11426] |
| Walker2d-v4 | 6096 [5621, 6547] | 3008 [1706, 4321] | 6039 [5644, 6386] | 5124 [4719, 5529] | 6938 [6691, 7185] | 6422 [5123, 7721] |
| IQM | 1.540 [1.500, 1.580] | 1.050 [0.890, 1.190] | 1.499 [1.361, 1.650] | 1.242 [1.117, 1.349] | 1.637 [1.470, 1.791] | 1.691 [1.473, 1.879] |
| Median | 1.550 [1.450, 1.630] | 1.180 [0.830, 1.220] | 1.488 [1.340, 1.623] | 1.261 [1.080, 1.344] | 1.616 [1.490, 1.744] | 1.564 [1.416, 1.806] |
| Mean | 1.570 [1.540, 1.600] | 1.040 [0.920, 1.150] | 1.465 [1.346, 1.585] | 1.196 [1.082, 1.307] | 1.617 [1.513, 1.718] | 1.608 [1.449, 1.759] |
DMC-Easy Tasks (500K steps / 1M env steps with action repeat 2)
Aggregate metrics reported in units of 1k.
| Task | MR.Q | Simba | SimbaV2 | FoG | DR.Q |
|---|---|---|---|---|---|
| acrobot-swingup | 567 [523, 616] | 431 [379, 482] | 436 [391, 482] | 414 [344, 485] | 569 [519, 619] |
| ball-in-cup-catch | 981 [979, 984] | 981 [978, 983] | 982 [980, 984] | 983 [981, 985] | 980 [979, 982] |
| cartpole-balance | 999 [999, 1000] | 998 [998, 999] | 999 [999, 999] | 997 [996, 999] | 999 [999, 1000] |
| cartpole-balance-sparse | 1000 [1000, 1000] | 991 [973, 1008] | 967 [904, 1030] | 1000 [1000, 1000] | 987 [963, 1012] |
| cartpole-swingup | 866 [866, 866] | 876 [871, 881] | 880 [876, 883] | 881 [880, 882] | 867 [866, 867] |
| cartpole-swingup-sparse | 798 [780, 818] | 825 [795, 854] | 848 [848, 849] | 840 [829, 850] | 805 [791, 818] |
| cheetah-run | 877 [849, 905] | 920 [918, 922] | 821 [642, 913] | 838 [732, 944] | 911 [905, 918] |
| finger-spin | 937 [917, 956] | 849 [758, 939] | 891 [810, 972] | 987 [986, 989] | 949 [917, 980] |
| finger-turn-easy | 953 [931, 974] | 935 [903, 968] | 953 [925, 980] | 949 [920, 977] | 956 [932, 980] |
| finger-turn-hard | 950 [910, 974] | 915 [859, 972] | 951 [925, 977] | 921 [863, 978] | 949 [923, 975] |
| fish-swim | 792 [773, 810] | 823 [799, 846] | 826 [806, 846] | 744 [701, 786] | 808 [788, 828] |
| hopper-hop | 251 [195, 301] | 385 [322, 449] | 290 [233, 348] | 335 [326, 345] | 384 [317, 451] |
| hopper-stand | 951 [948, 955] | 929 [900, 957] | 944 [926, 962] | 956 [953, 959] | 954 [949, 959] |
| pendulum-swingup | 748 [597, 829] | 737 [575, 899] | 827 [805, 849] | 838 [810, 866] | 835 [819, 852] |
| quadruped-run | 947 [940, 954] | 928 [916, 939] | 935 [928, 943] | 918 [906, 929] | 953 [949, 957] |
| quadruped-walk | 963 [959, 967] | 957 [951, 963] | 962 [955, 969] | 963 [960, 966] | 969 [964, 973] |
| reacher-easy | 983 [983, 985] | 983 [981, 986] | 983 [979, 986] | 980 [971, 990] | 975 [958, 993] |
| reacher-hard | 977 [975, 980] | 966 [947, 984] | 967 [946, 987] | 965 [944, 986] | 976 [973, 979] |
| walker-run | 793 [765, 815] | 796 [792, 801] | 817 [812, 821] | 851 [848, 853] | 809 [775, 844] |
| walker-stand | 988 [987, 990] | 985 [982, 989] | 987 [984, 990] | 987 [985, 989] | 991 [989, 992] |
| walker-walk | 978 [978, 980] | 975 [972, 978] | 976 [974, 978] | 978 [977, 980] | 979 [976, 982] |
| IQM | 0.936 [0.917, 0.952] | 0.922 [0.905, 0.938] | 0.933 [0.918, 0.948] | 0.935 [0.919, 0.951] | 0.937 [0.920, 0.951] |
| Median | 0.876 [0.847, 0.905] | 0.870 [0.841, 0.896] | 0.875 [0.847, 0.905] | 0.874 [0.845, 0.904] | 0.885 [0.863, 0.912] |
| Mean | 0.874 [0.848, 0.898] | 0.864 [0.840, 0.887] | 0.874 [0.849, 0.897] | 0.873 [0.847, 0.897] | 0.886 [0.865, 0.906] |
DMC-Hard Tasks (500K steps / 1M env steps with action repeat 2)
Aggregate metrics reported in units of 1k.
| Task | TDMPC2 | MR.Q | Simba | SimbaV2 | FoG | DR.Q |
|---|---|---|---|---|---|---|
| dog-run | 265 [166, 342] | 569 [547, 595] | 544 [525, 564] | 562 [516, 608] | 613 [577, 648] | 721 [684, 758] |
| dog-stand | 506 [266, 715] | 967 [960, 975] | 960 [951, 969] | 981 [977, 985] | 976 [969, 982] | 972 [963, 982] |
| dog-trot | 407 [265, 530] | 877 [845, 898] | 824 [773, 876] | 861 [772, 950] | 901 [892, 911] | 925 [914, 936] |
| dog-walk | 486 [240, 704] | 916 [908, 924] | 916 [905, 928] | 935 [927, 944] | 921 [909, 933] | 950 [942, 958] |
| humanoid-run | 181 [121, 231] | 200 [170, 236] | 181 [171, 191] | 194 [182, 207] | 292 [268, 317] | 465 [444, 485] |
| humanoid-stand | 658 [506, 745] | 868 [822, 903] | 846 [801, 890] | 916 [886, 945] | 931 [921, 941] | 938 [932, 944] |
| humanoid-walk | 754 [725, 791] | 662 [610, 724] | 668 [608, 728] | 651 [590, 713] | 878 [839, 917] | 925 [918, 932] |
| IQM | 0.464 [0.305, 0.632] | 0.796 [0.724, 0.860] | 0.773 [0.713, 0.830] | 0.808 [0.726, 0.879] | 0.880 [0.818, 0.914] | 0.917 [0.871, 0.936] |
| Median | 0.486 [0.265, 0.658] | 0.722 [0.654, 0.797] | 0.706 [0.647, 0.772] | 0.729 [0.655, 0.808] | 0.788 [0.724, 0.855] | 0.844 [0.796, 0.893] |
| Mean | 0.465 [0.329, 0.606] | 0.723 [0.660, 0.781] | 0.706 [0.656, 0.755] | 0.729 [0.664, 0.791] | 0.787 [0.730, 0.840] | 0.842 [0.800, 0.881] |
DMC Visual Tasks (500K steps / 1M env steps with action repeat 2)
Pixel-based observations at 84Γ84 resolution. Aggregate metrics computed over the success normalized score.
| Task | DrQ-v2 | PPO | TDMPC2 | DreamerV3 | MR.Q | DR.Q |
|---|---|---|---|---|---|---|
| acrobot-swingup | 168 [127, 219] | 2 [1, 4] | 197 [179, 217] | 121 [106, 145] | 287 [254, 316] | 324 [283, 365] |
| dog-run | 10 [9, 12] | 11 [9, 14] | 14 [10, 18] | 9 [6, 14] | 60 [44, 80] | 118 [104, 132] |
| dog-stand | 43 [37, 49] | 51 [48, 56] | 117 [72, 148] | 61 [30, 92] | 216 [201, 232] | 700 [660, 740] |
| dog-trot | 14 [11, 18] | 13 [12, 15] | 20 [14, 25] | 14 [13, 16] | 65 [55, 79] | 113 [98, 128] |
| dog-walk | 22 [18, 29] | 16 [14, 18] | 22 [17, 28] | 11 [11, 12] | 77 [71, 83] | 201 [146, 256] |
| hopper-hop | 224 [170, 278] | 0 [0, 0] | 187 [119, 238] | 205 [125, 287] | 270 [230, 315] | 330 [283, 377] |
| hopper-stand | 917 [903, 931] | 1 [0, 2] | 582 [321, 794] | 888 [875, 900] | 852 [703, 930] | 937 [930, 944] |
| humanoid-run | 1 [1, 1] | 1 [1, 1] | 0 [1, 1] | 1 [1, 1] | 1 [1, 2] | 1 [1, 1] |
| quadruped-run | 459 [412, 507] | 118 [98, 139] | 262 [184, 330] | 328 [255, 397] | 498 [476, 522] | 655 [573, 737] |
| quadruped-walk | 750 [699, 796] | 149 [113, 184] | 246 [179, 310] | 316 [260, 379] | 833 [797, 867] | 927 [914, 941] |
| reacher-hard | 705 [580, 831] | 10 [0, 30] | 911 [867, 946] | 338 [227, 461] | 965 [945, 977] | 954 [930, 979] |
| walker-run | 546 [475, 612] | 39 [35, 44] | 665 [566, 719] | 669 [615, 708] | 615 [571, 655] | 746 [713, 778] |
| IQM | 0.241 [0.214, 0.271] | 0.016 [0.013, 0.018] | 0.154 [0.113, 0.224] | 0.168 [0.152, 0.184] | 0.322 [0.239, 0.423] | 0.494 [0.395, 0.604] |
| Median | 0.191 [0.172, 0.211] | 0.013 [0.012, 0.013] | 0.295 [0.198, 0.339] | 0.134 [0.124, 0.198] | 0.398 [0.320, 0.466] | 0.500 [0.427, 0.576] |
| Mean | 0.321 [0.303, 0.340] | 0.034 [0.031, 0.037] | 0.269 [0.214, 0.326] | 0.247 [0.231, 0.262] | 0.395 [0.335, 0.457] | 0.501 [0.439, 0.564] |
HumanoidBench β Without Dexterous Hands (500K steps / 1M env steps with action repeat 2)
Aggregate metrics computed over the success normalized score.
| Task | Simba | SimbaV2 | MR.Q | FoG | DR.Q |
|---|---|---|---|---|---|
| h1-pole-v0 | 716 [667, 765] | 791 [785, 797] | 578 [534, 623] | 893 [846, 940] | 887 [853, 921] |
| h1-slide-v0 | 277 [252, 303] | 487 [404, 571] | 303 [270, 337] | 674 [562, 785] | 355 [324, 386] |
| h1-stair-v0 | 269 [153, 385] | 493 [467, 518] | 235 [213, 257] | 466 [383, 548] | 401 [328, 475] |
| h1-balance-hard-v0 | 75 [71, 80] | 143 [128, 157] | 69 [67, 72] | 81 [71, 91] | 92 [87, 97] |
| h1-balance-simple-v0 | 337 [193, 482] | 723 [651, 795] | 135 [110, 160] | 616 [536, 696] | 205 [166, 244] |
| h1-sit-hard-v0 | 512 [354, 670] | 679 [548, 811] | 553 [421, 686] | 770 [738, 802] | 843 [747, 939] |
| h1-sit-simple-v0 | 833 [814, 853] | 875 [870, 880] | 850 [819, 882] | 828 [800, 856] | 931 [924, 938] |
| h1-maze-v0 | 354 [342, 366] | 313 [287, 340] | 344 [340, 347] | 331 [310, 353] | 354 [349, 359] |
| h1-crawl-v0 | 923 [904, 942] | 946 [933, 959] | 932 [919, 945] | 971 [969, 973] | 973 [972, 974] |
| h1-hurdle-v0 | 175 [150, 201] | 202 [167, 236] | 131 [108, 155] | 114 [100, 129] | 344 [245, 443] |
| h1-reach-v0 | 3874 [3220, 4527] | 3850 [3272, 4427] | 4902 [4390, 5414] | 2434 [2083, 2785] | 8101 [7640, 8563] |
| h1-run-v0 | 232 [185, 279] | 415 [307, 524] | 278 [192, 364] | 749 [666, 832] | 820 [815, 824] |
| h1-stand-v0 | 772 [701, 843] | 814 [770, 857] | 800 [754, 846] | 671 [516, 825] | 856 [815, 897] |
| h1-walk-v0 | 550 [391, 709] | 845 [840, 850] | 716 [657, 775] | 866 [859, 872] | 850 [830, 869] |
| IQM | 0.521 [0.413, 0.633] | 0.799 [0.686, 0.908] | 0.519 [0.417, 0.630] | 0.846 [0.713, 0.969] | 0.864 [0.735, 0.976] |
| Median | 0.598 [0.514, 0.692] | 0.781 [0.693, 0.865] | 0.602 [0.516, 0.687] | 0.794 [0.705, 0.899] | 0.823 [0.733, 0.920] |
| Mean | 0.606 [0.536, 0.678] | 0.776 [0.705, 0.849] | 0.604 [0.531, 0.677] | 0.802 [0.721, 0.883] | 0.825 [0.748, 0.902] |
HumanoidBench β With Dexterous Hands (500K steps / 1M env steps with action repeat 2)
Aggregate metrics computed over the success normalized score.
| Task | DreamerV3 | TDMPC2 | SimBa | SimbaV2 | MR.Q | FoG | DR.Q |
|---|---|---|---|---|---|---|---|
| h1hand-door-v0 | 10 [7, 13] | 134 [23, 246] | 206 [169, 244] | 310 [302, 318] | 293 [280, 305] | 244 [227, 261] | 320 [308, 333] |
| h1hand-slide-v0 | 21 [19, 23] | 79 [68, 90] | 67 [55, 79] | 136 [97, 175] | 146 [131, 161] | 201 [173, 228] | 285 [258, 312] |
| h1hand-stair-v0 | 16 [8, 25] | 43 [35, 51] | 61 [44, 78] | 120 [89, 151] | 127 [104, 150] | 135 [126, 144] | 288 [193, 382] |
| h1hand-bookshelf-simple-v0 | 45 [41, 50] | 97 [59, 134] | 487 [315, 660] | 838 [834, 843] | 691 [599, 783] | 610 [523, 697] | 709 [572, 846] |
| h1hand-bookshelf-hard-v0 | 27 [24, 30] | 34 [19, 50] | 490 [447, 533] | 496 [417, 575] | 332 [240, 425] | 577 [548, 605] | 349 [262, 435] |
| h1hand-sit-simple-v0 | 48 [42, 54] | 607 [268, 947] | 643 [580, 705] | 927 [904, 951] | 653 [568, 737] | 631 [528, 735] | 942 [926, 958] |
| h1hand-sit-hard-v0 | 15 [11, 20] | 139 [86, 193] | 649 [500, 797] | 724 [609, 838] | 487 [353, 621] | 179 [128, 229] | 891 [841, 941] |
| h1hand-basketball-v0 | 13 [12, 13] | 47 [21, 73] | 54 [25, 83] | 56 [34, 78] | 53 [34, 72] | 182 [131, 232] | 75 [45, 105] |
| h1hand-pole-v0 | 48 [36, 60] | 99 [87, 111] | 224 [195, 254] | 493 [426, 559] | 237 [202, 273] | 257 [237, 277] | 424 [299, 549] |
| h1hand-crawl-v0 | 256 [244, 268] | 897 [858, 935] | 779 [748, 809] | 640 [549, 732] | 807 [783, 831] | 794 [721, 866] | 526 [477, 574] |
| h1hand-reach-v0 | 864 [578, 1150] | 3610 [2912, 4309] | 3185 [2664, 3707] | 3223 [2703, 3744] | 4101 [3540, 4662] | 2877 [2487, 3267] | 4950 [4280, 5619] |
| h1hand-run-v0 | 6 [4, 8] | 29 [27, 30] | 31 [24, 37] | 30 [22, 38] | 35 [29, 41] | 22 [19, 25] | 129 [77, 181] |
| h1hand-stand-v0 | 41 [38, 44] | 193 [147, 238] | 127 [72, 181] | 103 [81, 126] | 300 [194, 405] | 79 [66, 91] | 491 [344, 638] |
| h1hand-walk-v0 | 19 [12, 27] | 234 [125, 343] | 94 [79, 109] | 64 [52, 76] | 95 [77, 112] | 75 [63, 87] | 512 [371, 652] |
| IQM | 0.019 [0.013, 0.026] | 0.150 [0.091, 0.224] | 0.219 [0.179, 0.267] | 0.298 [0.241, 0.374] | 0.286 [0.245, 0.333] | 0.254 [0.222, 0.285] | 0.452 [0.400, 0.512] |
| Median | 0.021 [0.010, 0.030] | 0.298 [0.147, 0.433] | 0.356 [0.269, 0.413] | 0.420 [0.338, 0.491] | 0.388 [0.313, 0.449] | 0.342 [0.268, 0.395] | 0.529 [0.455, 0.607] |
| Mean | 0.020 [0.011, 0.028] | 0.282 [0.169, 0.413] | 0.345 [0.286, 0.406] | 0.417 [0.356, 0.482] | 0.385 [0.329, 0.443] | 0.336 [0.285, 0.393] | 0.534 [0.473, 0.595] |
Citation
@inproceedings{lyu2026debiased,
title={Debiased Model-based Representations for Sample-efficient Continuous Control},
author={Jiafei Lyu and Zichuan Lin and Scott Fujimoto and Kai Yang and Yangkun Chen and Saiyong Yang and Zongqing Lu and Deheng Ye},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=ZP1p8k106p}
}
Acknowledgements
DR.Q builds upon the MR.Q codebase by Facebook Research. We thank the authors of TD7, TDMPC2, MR.Q, FoG, SimBa, SimbaV2, DrQ-v2, DreamerV3, and PPO for their open-source implementations used as baselines.