DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control

Paper GitHub License

Official pretrained model weights for DR.Q, presented at the Forty-third International Conference on Machine Learning (ICML 2026).

Authors: Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye

Model Description

DR.Q is a model-free reinforcement learning algorithm that achieves strong sample efficiency in continuous control by learning debiased model-based representations. The key insight is that naively training model-based representations can introduce representation bias that hurts downstream policy learning. DR.Q debiases two sources of biases:

  1. Representation bias β€” mitigated by adding InfoNCE loss besides the MSE loss
  2. Sampling bias β€” mitigated by introducing the faded prioritized experience replay (Faded PER)

DR.Q builds upon and substantially extends the MR.Q codebase (Facebook Research).

How to Get Started

Installation

git clone https://github.com/dmksjfl/DR.Q
cd DR.Q
pip install -r requirements.txt

Training

# Gym / MuJoCo (1M steps)
python main.py --env Gym-HalfCheetah-v4
python main.py --env Gym-Humanoid-v4

# DeepMind Control Suite β€” proprioceptive (500K steps)
python main.py --env Dmc-cheetah-run
python main.py --env Dmc-quadruped-walk

# DeepMind Control Suite β€” pixel observations
python main.py --env Dmc-visual-dog-run
python main.py --env Dmc-visual-walker-walk

# HumanoidBench (requires separate installation)
python main.py --env HBench-h1-run-v0

Loading Pretrained Weights

Pretrained model weights for all reported tasks are hosted here on HuggingFace

Training Details

Evaluated Benchmark Suites

Suite Obs. Type Steps Tasks
Gym MuJoCo (Gymnasium) Proprioceptive 1M 5 tasks
DeepMind Control (DMC) β€” Easy Proprioceptive 500K 21 tasks
DeepMind Control (DMC) β€” Hard Proprioceptive 500K 7 tasks
DeepMind Control (DMC) β€” Visual Pixel (84Γ—84) 500K 12 tasks
HumanoidBench (w/o hands) Proprioceptive 500K 14 tasks
HumanoidBench (w/ hands) Proprioceptive 500K 14 tasks

Training Infrastructure

  • Framework: PyTorch β‰₯ 2.3.0
  • Python: 3.11 (compatible with 3.9–3.12)
  • Hardware: CUDA GPU (CPU also supported)
  • Seeds: Results averaged over 10 random seeds with 95% bootstrap confidence intervals

Evaluation Results

All results report the final average return at the end of training. Aggregate metrics (IQM, Median, Mean) are computed over the task-specific normalized score. Values in [brackets] denote 95% bootstrap confidence intervals.

Gym MuJoCo Tasks (1M environment steps)

Full comparison against domain-specific and general model-free / model-based RL algorithms. Aggregate metrics are computed over the TD3-normalized score.

Task TD7 TDMPC2 MR.Q FoG SimbaV2 DR.Q
Ant-v4 8509 [8168, 8844] 4751 [2988, 6145] 6901 [6261, 7482] 6761 [6161, 7360] 7429 [7209, 7649] 8138 [7764, 8511]
HalfCheetah-v4 17433 [17301, 17559] 15078 [14065, 15932] 12939 [11663, 13762] 11709 [9928, 13491] 12022 [11640, 12404] 14775 [14638, 14912]
Hopper-v4 3511 [3236, 3736] 2081 [1197, 2921] 2692 [2131, 3309] 1822 [1316, 2327] 4054 [3929, 4179] 2504 [1931, 3077]
Humanoid-v4 7428 [7304, 7553] 6071 [5770, 6333] 10223 [9929, 10498] 6737 [6319, 7155] 10546 [10195, 10897] 11239 [11052, 11426]
Walker2d-v4 6096 [5621, 6547] 3008 [1706, 4321] 6039 [5644, 6386] 5124 [4719, 5529] 6938 [6691, 7185] 6422 [5123, 7721]
IQM 1.540 [1.500, 1.580] 1.050 [0.890, 1.190] 1.499 [1.361, 1.650] 1.242 [1.117, 1.349] 1.637 [1.470, 1.791] 1.691 [1.473, 1.879]
Median 1.550 [1.450, 1.630] 1.180 [0.830, 1.220] 1.488 [1.340, 1.623] 1.261 [1.080, 1.344] 1.616 [1.490, 1.744] 1.564 [1.416, 1.806]
Mean 1.570 [1.540, 1.600] 1.040 [0.920, 1.150] 1.465 [1.346, 1.585] 1.196 [1.082, 1.307] 1.617 [1.513, 1.718] 1.608 [1.449, 1.759]

DMC-Easy Tasks (500K steps / 1M env steps with action repeat 2)

Aggregate metrics reported in units of 1k.

Task MR.Q Simba SimbaV2 FoG DR.Q
acrobot-swingup 567 [523, 616] 431 [379, 482] 436 [391, 482] 414 [344, 485] 569 [519, 619]
ball-in-cup-catch 981 [979, 984] 981 [978, 983] 982 [980, 984] 983 [981, 985] 980 [979, 982]
cartpole-balance 999 [999, 1000] 998 [998, 999] 999 [999, 999] 997 [996, 999] 999 [999, 1000]
cartpole-balance-sparse 1000 [1000, 1000] 991 [973, 1008] 967 [904, 1030] 1000 [1000, 1000] 987 [963, 1012]
cartpole-swingup 866 [866, 866] 876 [871, 881] 880 [876, 883] 881 [880, 882] 867 [866, 867]
cartpole-swingup-sparse 798 [780, 818] 825 [795, 854] 848 [848, 849] 840 [829, 850] 805 [791, 818]
cheetah-run 877 [849, 905] 920 [918, 922] 821 [642, 913] 838 [732, 944] 911 [905, 918]
finger-spin 937 [917, 956] 849 [758, 939] 891 [810, 972] 987 [986, 989] 949 [917, 980]
finger-turn-easy 953 [931, 974] 935 [903, 968] 953 [925, 980] 949 [920, 977] 956 [932, 980]
finger-turn-hard 950 [910, 974] 915 [859, 972] 951 [925, 977] 921 [863, 978] 949 [923, 975]
fish-swim 792 [773, 810] 823 [799, 846] 826 [806, 846] 744 [701, 786] 808 [788, 828]
hopper-hop 251 [195, 301] 385 [322, 449] 290 [233, 348] 335 [326, 345] 384 [317, 451]
hopper-stand 951 [948, 955] 929 [900, 957] 944 [926, 962] 956 [953, 959] 954 [949, 959]
pendulum-swingup 748 [597, 829] 737 [575, 899] 827 [805, 849] 838 [810, 866] 835 [819, 852]
quadruped-run 947 [940, 954] 928 [916, 939] 935 [928, 943] 918 [906, 929] 953 [949, 957]
quadruped-walk 963 [959, 967] 957 [951, 963] 962 [955, 969] 963 [960, 966] 969 [964, 973]
reacher-easy 983 [983, 985] 983 [981, 986] 983 [979, 986] 980 [971, 990] 975 [958, 993]
reacher-hard 977 [975, 980] 966 [947, 984] 967 [946, 987] 965 [944, 986] 976 [973, 979]
walker-run 793 [765, 815] 796 [792, 801] 817 [812, 821] 851 [848, 853] 809 [775, 844]
walker-stand 988 [987, 990] 985 [982, 989] 987 [984, 990] 987 [985, 989] 991 [989, 992]
walker-walk 978 [978, 980] 975 [972, 978] 976 [974, 978] 978 [977, 980] 979 [976, 982]
IQM 0.936 [0.917, 0.952] 0.922 [0.905, 0.938] 0.933 [0.918, 0.948] 0.935 [0.919, 0.951] 0.937 [0.920, 0.951]
Median 0.876 [0.847, 0.905] 0.870 [0.841, 0.896] 0.875 [0.847, 0.905] 0.874 [0.845, 0.904] 0.885 [0.863, 0.912]
Mean 0.874 [0.848, 0.898] 0.864 [0.840, 0.887] 0.874 [0.849, 0.897] 0.873 [0.847, 0.897] 0.886 [0.865, 0.906]

DMC-Hard Tasks (500K steps / 1M env steps with action repeat 2)

Aggregate metrics reported in units of 1k.

Task TDMPC2 MR.Q Simba SimbaV2 FoG DR.Q
dog-run 265 [166, 342] 569 [547, 595] 544 [525, 564] 562 [516, 608] 613 [577, 648] 721 [684, 758]
dog-stand 506 [266, 715] 967 [960, 975] 960 [951, 969] 981 [977, 985] 976 [969, 982] 972 [963, 982]
dog-trot 407 [265, 530] 877 [845, 898] 824 [773, 876] 861 [772, 950] 901 [892, 911] 925 [914, 936]
dog-walk 486 [240, 704] 916 [908, 924] 916 [905, 928] 935 [927, 944] 921 [909, 933] 950 [942, 958]
humanoid-run 181 [121, 231] 200 [170, 236] 181 [171, 191] 194 [182, 207] 292 [268, 317] 465 [444, 485]
humanoid-stand 658 [506, 745] 868 [822, 903] 846 [801, 890] 916 [886, 945] 931 [921, 941] 938 [932, 944]
humanoid-walk 754 [725, 791] 662 [610, 724] 668 [608, 728] 651 [590, 713] 878 [839, 917] 925 [918, 932]
IQM 0.464 [0.305, 0.632] 0.796 [0.724, 0.860] 0.773 [0.713, 0.830] 0.808 [0.726, 0.879] 0.880 [0.818, 0.914] 0.917 [0.871, 0.936]
Median 0.486 [0.265, 0.658] 0.722 [0.654, 0.797] 0.706 [0.647, 0.772] 0.729 [0.655, 0.808] 0.788 [0.724, 0.855] 0.844 [0.796, 0.893]
Mean 0.465 [0.329, 0.606] 0.723 [0.660, 0.781] 0.706 [0.656, 0.755] 0.729 [0.664, 0.791] 0.787 [0.730, 0.840] 0.842 [0.800, 0.881]

DMC Visual Tasks (500K steps / 1M env steps with action repeat 2)

Pixel-based observations at 84Γ—84 resolution. Aggregate metrics computed over the success normalized score.

Task DrQ-v2 PPO TDMPC2 DreamerV3 MR.Q DR.Q
acrobot-swingup 168 [127, 219] 2 [1, 4] 197 [179, 217] 121 [106, 145] 287 [254, 316] 324 [283, 365]
dog-run 10 [9, 12] 11 [9, 14] 14 [10, 18] 9 [6, 14] 60 [44, 80] 118 [104, 132]
dog-stand 43 [37, 49] 51 [48, 56] 117 [72, 148] 61 [30, 92] 216 [201, 232] 700 [660, 740]
dog-trot 14 [11, 18] 13 [12, 15] 20 [14, 25] 14 [13, 16] 65 [55, 79] 113 [98, 128]
dog-walk 22 [18, 29] 16 [14, 18] 22 [17, 28] 11 [11, 12] 77 [71, 83] 201 [146, 256]
hopper-hop 224 [170, 278] 0 [0, 0] 187 [119, 238] 205 [125, 287] 270 [230, 315] 330 [283, 377]
hopper-stand 917 [903, 931] 1 [0, 2] 582 [321, 794] 888 [875, 900] 852 [703, 930] 937 [930, 944]
humanoid-run 1 [1, 1] 1 [1, 1] 0 [1, 1] 1 [1, 1] 1 [1, 2] 1 [1, 1]
quadruped-run 459 [412, 507] 118 [98, 139] 262 [184, 330] 328 [255, 397] 498 [476, 522] 655 [573, 737]
quadruped-walk 750 [699, 796] 149 [113, 184] 246 [179, 310] 316 [260, 379] 833 [797, 867] 927 [914, 941]
reacher-hard 705 [580, 831] 10 [0, 30] 911 [867, 946] 338 [227, 461] 965 [945, 977] 954 [930, 979]
walker-run 546 [475, 612] 39 [35, 44] 665 [566, 719] 669 [615, 708] 615 [571, 655] 746 [713, 778]
IQM 0.241 [0.214, 0.271] 0.016 [0.013, 0.018] 0.154 [0.113, 0.224] 0.168 [0.152, 0.184] 0.322 [0.239, 0.423] 0.494 [0.395, 0.604]
Median 0.191 [0.172, 0.211] 0.013 [0.012, 0.013] 0.295 [0.198, 0.339] 0.134 [0.124, 0.198] 0.398 [0.320, 0.466] 0.500 [0.427, 0.576]
Mean 0.321 [0.303, 0.340] 0.034 [0.031, 0.037] 0.269 [0.214, 0.326] 0.247 [0.231, 0.262] 0.395 [0.335, 0.457] 0.501 [0.439, 0.564]

HumanoidBench β€” Without Dexterous Hands (500K steps / 1M env steps with action repeat 2)

Aggregate metrics computed over the success normalized score.

Task Simba SimbaV2 MR.Q FoG DR.Q
h1-pole-v0 716 [667, 765] 791 [785, 797] 578 [534, 623] 893 [846, 940] 887 [853, 921]
h1-slide-v0 277 [252, 303] 487 [404, 571] 303 [270, 337] 674 [562, 785] 355 [324, 386]
h1-stair-v0 269 [153, 385] 493 [467, 518] 235 [213, 257] 466 [383, 548] 401 [328, 475]
h1-balance-hard-v0 75 [71, 80] 143 [128, 157] 69 [67, 72] 81 [71, 91] 92 [87, 97]
h1-balance-simple-v0 337 [193, 482] 723 [651, 795] 135 [110, 160] 616 [536, 696] 205 [166, 244]
h1-sit-hard-v0 512 [354, 670] 679 [548, 811] 553 [421, 686] 770 [738, 802] 843 [747, 939]
h1-sit-simple-v0 833 [814, 853] 875 [870, 880] 850 [819, 882] 828 [800, 856] 931 [924, 938]
h1-maze-v0 354 [342, 366] 313 [287, 340] 344 [340, 347] 331 [310, 353] 354 [349, 359]
h1-crawl-v0 923 [904, 942] 946 [933, 959] 932 [919, 945] 971 [969, 973] 973 [972, 974]
h1-hurdle-v0 175 [150, 201] 202 [167, 236] 131 [108, 155] 114 [100, 129] 344 [245, 443]
h1-reach-v0 3874 [3220, 4527] 3850 [3272, 4427] 4902 [4390, 5414] 2434 [2083, 2785] 8101 [7640, 8563]
h1-run-v0 232 [185, 279] 415 [307, 524] 278 [192, 364] 749 [666, 832] 820 [815, 824]
h1-stand-v0 772 [701, 843] 814 [770, 857] 800 [754, 846] 671 [516, 825] 856 [815, 897]
h1-walk-v0 550 [391, 709] 845 [840, 850] 716 [657, 775] 866 [859, 872] 850 [830, 869]
IQM 0.521 [0.413, 0.633] 0.799 [0.686, 0.908] 0.519 [0.417, 0.630] 0.846 [0.713, 0.969] 0.864 [0.735, 0.976]
Median 0.598 [0.514, 0.692] 0.781 [0.693, 0.865] 0.602 [0.516, 0.687] 0.794 [0.705, 0.899] 0.823 [0.733, 0.920]
Mean 0.606 [0.536, 0.678] 0.776 [0.705, 0.849] 0.604 [0.531, 0.677] 0.802 [0.721, 0.883] 0.825 [0.748, 0.902]

HumanoidBench β€” With Dexterous Hands (500K steps / 1M env steps with action repeat 2)

Aggregate metrics computed over the success normalized score.

Task DreamerV3 TDMPC2 SimBa SimbaV2 MR.Q FoG DR.Q
h1hand-door-v0 10 [7, 13] 134 [23, 246] 206 [169, 244] 310 [302, 318] 293 [280, 305] 244 [227, 261] 320 [308, 333]
h1hand-slide-v0 21 [19, 23] 79 [68, 90] 67 [55, 79] 136 [97, 175] 146 [131, 161] 201 [173, 228] 285 [258, 312]
h1hand-stair-v0 16 [8, 25] 43 [35, 51] 61 [44, 78] 120 [89, 151] 127 [104, 150] 135 [126, 144] 288 [193, 382]
h1hand-bookshelf-simple-v0 45 [41, 50] 97 [59, 134] 487 [315, 660] 838 [834, 843] 691 [599, 783] 610 [523, 697] 709 [572, 846]
h1hand-bookshelf-hard-v0 27 [24, 30] 34 [19, 50] 490 [447, 533] 496 [417, 575] 332 [240, 425] 577 [548, 605] 349 [262, 435]
h1hand-sit-simple-v0 48 [42, 54] 607 [268, 947] 643 [580, 705] 927 [904, 951] 653 [568, 737] 631 [528, 735] 942 [926, 958]
h1hand-sit-hard-v0 15 [11, 20] 139 [86, 193] 649 [500, 797] 724 [609, 838] 487 [353, 621] 179 [128, 229] 891 [841, 941]
h1hand-basketball-v0 13 [12, 13] 47 [21, 73] 54 [25, 83] 56 [34, 78] 53 [34, 72] 182 [131, 232] 75 [45, 105]
h1hand-pole-v0 48 [36, 60] 99 [87, 111] 224 [195, 254] 493 [426, 559] 237 [202, 273] 257 [237, 277] 424 [299, 549]
h1hand-crawl-v0 256 [244, 268] 897 [858, 935] 779 [748, 809] 640 [549, 732] 807 [783, 831] 794 [721, 866] 526 [477, 574]
h1hand-reach-v0 864 [578, 1150] 3610 [2912, 4309] 3185 [2664, 3707] 3223 [2703, 3744] 4101 [3540, 4662] 2877 [2487, 3267] 4950 [4280, 5619]
h1hand-run-v0 6 [4, 8] 29 [27, 30] 31 [24, 37] 30 [22, 38] 35 [29, 41] 22 [19, 25] 129 [77, 181]
h1hand-stand-v0 41 [38, 44] 193 [147, 238] 127 [72, 181] 103 [81, 126] 300 [194, 405] 79 [66, 91] 491 [344, 638]
h1hand-walk-v0 19 [12, 27] 234 [125, 343] 94 [79, 109] 64 [52, 76] 95 [77, 112] 75 [63, 87] 512 [371, 652]
IQM 0.019 [0.013, 0.026] 0.150 [0.091, 0.224] 0.219 [0.179, 0.267] 0.298 [0.241, 0.374] 0.286 [0.245, 0.333] 0.254 [0.222, 0.285] 0.452 [0.400, 0.512]
Median 0.021 [0.010, 0.030] 0.298 [0.147, 0.433] 0.356 [0.269, 0.413] 0.420 [0.338, 0.491] 0.388 [0.313, 0.449] 0.342 [0.268, 0.395] 0.529 [0.455, 0.607]
Mean 0.020 [0.011, 0.028] 0.282 [0.169, 0.413] 0.345 [0.286, 0.406] 0.417 [0.356, 0.482] 0.385 [0.329, 0.443] 0.336 [0.285, 0.393] 0.534 [0.473, 0.595]

Citation

@inproceedings{lyu2026debiased,
  title={Debiased Model-based Representations for Sample-efficient Continuous Control},
  author={Jiafei Lyu and Zichuan Lin and Scott Fujimoto and Kai Yang and Yangkun Chen and Saiyong Yang and Zongqing Lu and Deheng Ye},
  booktitle={Forty-third International Conference on Machine Learning},
  year={2026},
  url={https://openreview.net/forum?id=ZP1p8k106p}
}

Acknowledgements

DR.Q builds upon the MR.Q codebase by Facebook Research. We thank the authors of TD7, TDMPC2, MR.Q, FoG, SimBa, SimbaV2, DrQ-v2, DreamerV3, and PPO for their open-source implementations used as baselines.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading