DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control

Official pretrained model weights for DR.Q, presented at the Forty-third International Conference on Machine Learning (ICML 2026).

Authors: Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye

Model Description

DR.Q is a model-free reinforcement learning algorithm that achieves strong sample efficiency in continuous control by learning debiased model-based representations. The key insight is that naively training model-based representations can introduce representation bias that hurts downstream policy learning. DR.Q debiases two sources of biases:

Representation bias — mitigated by adding InfoNCE loss besides the MSE loss
Sampling bias — mitigated by introducing the faded prioritized experience replay (Faded PER)

DR.Q builds upon and substantially extends the MR.Q codebase (Facebook Research).

How to Get Started

Installation

git clone https://github.com/dmksjfl/DR.Q
cd DR.Q
pip install -r requirements.txt

Training

# Gym / MuJoCo (1M steps)
python main.py --env Gym-HalfCheetah-v4
python main.py --env Gym-Humanoid-v4

# DeepMind Control Suite — proprioceptive (500K steps)
python main.py --env Dmc-cheetah-run
python main.py --env Dmc-quadruped-walk

# DeepMind Control Suite — pixel observations
python main.py --env Dmc-visual-dog-run
python main.py --env Dmc-visual-walker-walk

# HumanoidBench (requires separate installation)
python main.py --env HBench-h1-run-v0

Loading Pretrained Weights

Pretrained model weights for all reported tasks are hosted here on HuggingFace

Training Details

Evaluated Benchmark Suites

Suite	Obs. Type	Steps	Tasks
Gym MuJoCo (Gymnasium)	Proprioceptive	1M	5 tasks
DeepMind Control (DMC) — Easy	Proprioceptive	500K	21 tasks
DeepMind Control (DMC) — Hard	Proprioceptive	500K	7 tasks
DeepMind Control (DMC) — Visual	Pixel (84×84)	500K	12 tasks
HumanoidBench (w/o hands)	Proprioceptive	500K	14 tasks
HumanoidBench (w/ hands)	Proprioceptive	500K	14 tasks

Training Infrastructure

Framework: PyTorch ≥ 2.3.0
Python: 3.11 (compatible with 3.9–3.12)
Hardware: CUDA GPU (CPU also supported)
Seeds: Results averaged over 10 random seeds with 95% bootstrap confidence intervals

Evaluation Results

All results report the final average return at the end of training. Aggregate metrics (IQM, Median, Mean) are computed over the task-specific normalized score. Values in [brackets] denote 95% bootstrap confidence intervals.

Gym MuJoCo Tasks (1M environment steps)

Full comparison against domain-specific and general model-free / model-based RL algorithms. Aggregate metrics are computed over the TD3-normalized score.

Task	TD7	TDMPC2	MR.Q	FoG	SimbaV2	DR.Q
Ant-v4	8509 [8168, 8844]	4751 [2988, 6145]	6901 [6261, 7482]	6761 [6161, 7360]	7429 [7209, 7649]	8138 [7764, 8511]
HalfCheetah-v4	17433 [17301, 17559]	15078 [14065, 15932]	12939 [11663, 13762]	11709 [9928, 13491]	12022 [11640, 12404]	14775 [14638, 14912]
Hopper-v4	3511 [3236, 3736]	2081 [1197, 2921]	2692 [2131, 3309]	1822 [1316, 2327]	4054 [3929, 4179]	2504 [1931, 3077]
Humanoid-v4	7428 [7304, 7553]	6071 [5770, 6333]	10223 [9929, 10498]	6737 [6319, 7155]	10546 [10195, 10897]	11239 [11052, 11426]
Walker2d-v4	6096 [5621, 6547]	3008 [1706, 4321]	6039 [5644, 6386]	5124 [4719, 5529]	6938 [6691, 7185]	6422 [5123, 7721]
IQM	1.540 [1.500, 1.580]	1.050 [0.890, 1.190]	1.499 [1.361, 1.650]	1.242 [1.117, 1.349]	1.637 [1.470, 1.791]	1.691 [1.473, 1.879]
Median	1.550 [1.450, 1.630]	1.180 [0.830, 1.220]	1.488 [1.340, 1.623]	1.261 [1.080, 1.344]	1.616 [1.490, 1.744]	1.564 [1.416, 1.806]
Mean	1.570 [1.540, 1.600]	1.040 [0.920, 1.150]	1.465 [1.346, 1.585]	1.196 [1.082, 1.307]	1.617 [1.513, 1.718]	1.608 [1.449, 1.759]

DMC-Easy Tasks (500K steps / 1M env steps with action repeat 2)

Aggregate metrics reported in units of 1k.

Task	MR.Q	Simba	SimbaV2	FoG	DR.Q
acrobot-swingup	567 [523, 616]	431 [379, 482]	436 [391, 482]	414 [344, 485]	569 [519, 619]
ball-in-cup-catch	981 [979, 984]	981 [978, 983]	982 [980, 984]	983 [981, 985]	980 [979, 982]
cartpole-balance	999 [999, 1000]	998 [998, 999]	999 [999, 999]	997 [996, 999]	999 [999, 1000]
cartpole-balance-sparse	1000 [1000, 1000]	991 [973, 1008]	967 [904, 1030]	1000 [1000, 1000]	987 [963, 1012]
cartpole-swingup	866 [866, 866]	876 [871, 881]	880 [876, 883]	881 [880, 882]	867 [866, 867]
cartpole-swingup-sparse	798 [780, 818]	825 [795, 854]	848 [848, 849]	840 [829, 850]	805 [791, 818]
cheetah-run	877 [849, 905]	920 [918, 922]	821 [642, 913]	838 [732, 944]	911 [905, 918]
finger-spin	937 [917, 956]	849 [758, 939]	891 [810, 972]	987 [986, 989]	949 [917, 980]
finger-turn-easy	953 [931, 974]	935 [903, 968]	953 [925, 980]	949 [920, 977]	956 [932, 980]
finger-turn-hard	950 [910, 974]	915 [859, 972]	951 [925, 977]	921 [863, 978]	949 [923, 975]
fish-swim	792 [773, 810]	823 [799, 846]	826 [806, 846]	744 [701, 786]	808 [788, 828]
hopper-hop	251 [195, 301]	385 [322, 449]	290 [233, 348]	335 [326, 345]	384 [317, 451]
hopper-stand	951 [948, 955]	929 [900, 957]	944 [926, 962]	956 [953, 959]	954 [949, 959]
pendulum-swingup	748 [597, 829]	737 [575, 899]	827 [805, 849]	838 [810, 866]	835 [819, 852]
quadruped-run	947 [940, 954]	928 [916, 939]	935 [928, 943]	918 [906, 929]	953 [949, 957]
quadruped-walk	963 [959, 967]	957 [951, 963]	962 [955, 969]	963 [960, 966]	969 [964, 973]
reacher-easy	983 [983, 985]	983 [981, 986]	983 [979, 986]	980 [971, 990]	975 [958, 993]
reacher-hard	977 [975, 980]	966 [947, 984]	967 [946, 987]	965 [944, 986]	976 [973, 979]
walker-run	793 [765, 815]	796 [792, 801]	817 [812, 821]	851 [848, 853]	809 [775, 844]
walker-stand	988 [987, 990]	985 [982, 989]	987 [984, 990]	987 [985, 989]	991 [989, 992]
walker-walk	978 [978, 980]	975 [972, 978]	976 [974, 978]	978 [977, 980]	979 [976, 982]
IQM	0.936 [0.917, 0.952]	0.922 [0.905, 0.938]	0.933 [0.918, 0.948]	0.935 [0.919, 0.951]	0.937 [0.920, 0.951]
Median	0.876 [0.847, 0.905]	0.870 [0.841, 0.896]	0.875 [0.847, 0.905]	0.874 [0.845, 0.904]	0.885 [0.863, 0.912]
Mean	0.874 [0.848, 0.898]	0.864 [0.840, 0.887]	0.874 [0.849, 0.897]	0.873 [0.847, 0.897]	0.886 [0.865, 0.906]

DMC-Hard Tasks (500K steps / 1M env steps with action repeat 2)

Aggregate metrics reported in units of 1k.

Task	TDMPC2	MR.Q	Simba	SimbaV2	FoG	DR.Q
dog-run	265 [166, 342]	569 [547, 595]	544 [525, 564]	562 [516, 608]	613 [577, 648]	721 [684, 758]
dog-stand	506 [266, 715]	967 [960, 975]	960 [951, 969]	981 [977, 985]	976 [969, 982]	972 [963, 982]
dog-trot	407 [265, 530]	877 [845, 898]	824 [773, 876]	861 [772, 950]	901 [892, 911]	925 [914, 936]
dog-walk	486 [240, 704]	916 [908, 924]	916 [905, 928]	935 [927, 944]	921 [909, 933]	950 [942, 958]
humanoid-run	181 [121, 231]	200 [170, 236]	181 [171, 191]	194 [182, 207]	292 [268, 317]	465 [444, 485]
humanoid-stand	658 [506, 745]	868 [822, 903]	846 [801, 890]	916 [886, 945]	931 [921, 941]	938 [932, 944]
humanoid-walk	754 [725, 791]	662 [610, 724]	668 [608, 728]	651 [590, 713]	878 [839, 917]	925 [918, 932]
IQM	0.464 [0.305, 0.632]	0.796 [0.724, 0.860]	0.773 [0.713, 0.830]	0.808 [0.726, 0.879]	0.880 [0.818, 0.914]	0.917 [0.871, 0.936]
Median	0.486 [0.265, 0.658]	0.722 [0.654, 0.797]	0.706 [0.647, 0.772]	0.729 [0.655, 0.808]	0.788 [0.724, 0.855]	0.844 [0.796, 0.893]
Mean	0.465 [0.329, 0.606]	0.723 [0.660, 0.781]	0.706 [0.656, 0.755]	0.729 [0.664, 0.791]	0.787 [0.730, 0.840]	0.842 [0.800, 0.881]

DMC Visual Tasks (500K steps / 1M env steps with action repeat 2)

Pixel-based observations at 84×84 resolution. Aggregate metrics computed over the success normalized score.

Task	DrQ-v2	PPO	TDMPC2	DreamerV3	MR.Q	DR.Q
acrobot-swingup	168 [127, 219]	2 [1, 4]	197 [179, 217]	121 [106, 145]	287 [254, 316]	324 [283, 365]
dog-run	10 [9, 12]	11 [9, 14]	14 [10, 18]	9 [6, 14]	60 [44, 80]	118 [104, 132]
dog-stand	43 [37, 49]	51 [48, 56]	117 [72, 148]	61 [30, 92]	216 [201, 232]	700 [660, 740]
dog-trot	14 [11, 18]	13 [12, 15]	20 [14, 25]	14 [13, 16]	65 [55, 79]	113 [98, 128]
dog-walk	22 [18, 29]	16 [14, 18]	22 [17, 28]	11 [11, 12]	77 [71, 83]	201 [146, 256]
hopper-hop	224 [170, 278]	0 [0, 0]	187 [119, 238]	205 [125, 287]	270 [230, 315]	330 [283, 377]
hopper-stand	917 [903, 931]	1 [0, 2]	582 [321, 794]	888 [875, 900]	852 [703, 930]	937 [930, 944]
humanoid-run	1 [1, 1]	1 [1, 1]	0 [1, 1]	1 [1, 1]	1 [1, 2]	1 [1, 1]
quadruped-run	459 [412, 507]	118 [98, 139]	262 [184, 330]	328 [255, 397]	498 [476, 522]	655 [573, 737]
quadruped-walk	750 [699, 796]	149 [113, 184]	246 [179, 310]	316 [260, 379]	833 [797, 867]	927 [914, 941]
reacher-hard	705 [580, 831]	10 [0, 30]	911 [867, 946]	338 [227, 461]	965 [945, 977]	954 [930, 979]
walker-run	546 [475, 612]	39 [35, 44]	665 [566, 719]	669 [615, 708]	615 [571, 655]	746 [713, 778]
IQM	0.241 [0.214, 0.271]	0.016 [0.013, 0.018]	0.154 [0.113, 0.224]	0.168 [0.152, 0.184]	0.322 [0.239, 0.423]	0.494 [0.395, 0.604]
Median	0.191 [0.172, 0.211]	0.013 [0.012, 0.013]	0.295 [0.198, 0.339]	0.134 [0.124, 0.198]	0.398 [0.320, 0.466]	0.500 [0.427, 0.576]
Mean	0.321 [0.303, 0.340]	0.034 [0.031, 0.037]	0.269 [0.214, 0.326]	0.247 [0.231, 0.262]	0.395 [0.335, 0.457]	0.501 [0.439, 0.564]

HumanoidBench — Without Dexterous Hands (500K steps / 1M env steps with action repeat 2)

Aggregate metrics computed over the success normalized score.

Task	Simba	SimbaV2	MR.Q	FoG	DR.Q
h1-pole-v0	716 [667, 765]	791 [785, 797]	578 [534, 623]	893 [846, 940]	887 [853, 921]
h1-slide-v0	277 [252, 303]	487 [404, 571]	303 [270, 337]	674 [562, 785]	355 [324, 386]
h1-stair-v0	269 [153, 385]	493 [467, 518]	235 [213, 257]	466 [383, 548]	401 [328, 475]
h1-balance-hard-v0	75 [71, 80]	143 [128, 157]	69 [67, 72]	81 [71, 91]	92 [87, 97]
h1-balance-simple-v0	337 [193, 482]	723 [651, 795]	135 [110, 160]	616 [536, 696]	205 [166, 244]
h1-sit-hard-v0	512 [354, 670]	679 [548, 811]	553 [421, 686]	770 [738, 802]	843 [747, 939]
h1-sit-simple-v0	833 [814, 853]	875 [870, 880]	850 [819, 882]	828 [800, 856]	931 [924, 938]
h1-maze-v0	354 [342, 366]	313 [287, 340]	344 [340, 347]	331 [310, 353]	354 [349, 359]
h1-crawl-v0	923 [904, 942]	946 [933, 959]	932 [919, 945]	971 [969, 973]	973 [972, 974]
h1-hurdle-v0	175 [150, 201]	202 [167, 236]	131 [108, 155]	114 [100, 129]	344 [245, 443]
h1-reach-v0	3874 [3220, 4527]	3850 [3272, 4427]	4902 [4390, 5414]	2434 [2083, 2785]	8101 [7640, 8563]
h1-run-v0	232 [185, 279]	415 [307, 524]	278 [192, 364]	749 [666, 832]	820 [815, 824]
h1-stand-v0	772 [701, 843]	814 [770, 857]	800 [754, 846]	671 [516, 825]	856 [815, 897]
h1-walk-v0	550 [391, 709]	845 [840, 850]	716 [657, 775]	866 [859, 872]	850 [830, 869]
IQM	0.521 [0.413, 0.633]	0.799 [0.686, 0.908]	0.519 [0.417, 0.630]	0.846 [0.713, 0.969]	0.864 [0.735, 0.976]
Median	0.598 [0.514, 0.692]	0.781 [0.693, 0.865]	0.602 [0.516, 0.687]	0.794 [0.705, 0.899]	0.823 [0.733, 0.920]
Mean	0.606 [0.536, 0.678]	0.776 [0.705, 0.849]	0.604 [0.531, 0.677]	0.802 [0.721, 0.883]	0.825 [0.748, 0.902]

HumanoidBench — With Dexterous Hands (500K steps / 1M env steps with action repeat 2)

Aggregate metrics computed over the success normalized score.

Task	DreamerV3	TDMPC2	SimBa	SimbaV2	MR.Q	FoG	DR.Q
h1hand-door-v0	10 [7, 13]	134 [23, 246]	206 [169, 244]	310 [302, 318]	293 [280, 305]	244 [227, 261]	320 [308, 333]
h1hand-slide-v0	21 [19, 23]	79 [68, 90]	67 [55, 79]	136 [97, 175]	146 [131, 161]	201 [173, 228]	285 [258, 312]
h1hand-stair-v0	16 [8, 25]	43 [35, 51]	61 [44, 78]	120 [89, 151]	127 [104, 150]	135 [126, 144]	288 [193, 382]
h1hand-bookshelf-simple-v0	45 [41, 50]	97 [59, 134]	487 [315, 660]	838 [834, 843]	691 [599, 783]	610 [523, 697]	709 [572, 846]
h1hand-bookshelf-hard-v0	27 [24, 30]	34 [19, 50]	490 [447, 533]	496 [417, 575]	332 [240, 425]	577 [548, 605]	349 [262, 435]
h1hand-sit-simple-v0	48 [42, 54]	607 [268, 947]	643 [580, 705]	927 [904, 951]	653 [568, 737]	631 [528, 735]	942 [926, 958]
h1hand-sit-hard-v0	15 [11, 20]	139 [86, 193]	649 [500, 797]	724 [609, 838]	487 [353, 621]	179 [128, 229]	891 [841, 941]
h1hand-basketball-v0	13 [12, 13]	47 [21, 73]	54 [25, 83]	56 [34, 78]	53 [34, 72]	182 [131, 232]	75 [45, 105]
h1hand-pole-v0	48 [36, 60]	99 [87, 111]	224 [195, 254]	493 [426, 559]	237 [202, 273]	257 [237, 277]	424 [299, 549]
h1hand-crawl-v0	256 [244, 268]	897 [858, 935]	779 [748, 809]	640 [549, 732]	807 [783, 831]	794 [721, 866]	526 [477, 574]
h1hand-reach-v0	864 [578, 1150]	3610 [2912, 4309]	3185 [2664, 3707]	3223 [2703, 3744]	4101 [3540, 4662]	2877 [2487, 3267]	4950 [4280, 5619]
h1hand-run-v0	6 [4, 8]	29 [27, 30]	31 [24, 37]	30 [22, 38]	35 [29, 41]	22 [19, 25]	129 [77, 181]
h1hand-stand-v0	41 [38, 44]	193 [147, 238]	127 [72, 181]	103 [81, 126]	300 [194, 405]	79 [66, 91]	491 [344, 638]
h1hand-walk-v0	19 [12, 27]	234 [125, 343]	94 [79, 109]	64 [52, 76]	95 [77, 112]	75 [63, 87]	512 [371, 652]
IQM	0.019 [0.013, 0.026]	0.150 [0.091, 0.224]	0.219 [0.179, 0.267]	0.298 [0.241, 0.374]	0.286 [0.245, 0.333]	0.254 [0.222, 0.285]	0.452 [0.400, 0.512]
Median	0.021 [0.010, 0.030]	0.298 [0.147, 0.433]	0.356 [0.269, 0.413]	0.420 [0.338, 0.491]	0.388 [0.313, 0.449]	0.342 [0.268, 0.395]	0.529 [0.455, 0.607]
Mean	0.020 [0.011, 0.028]	0.282 [0.169, 0.413]	0.345 [0.286, 0.406]	0.417 [0.356, 0.482]	0.385 [0.329, 0.443]	0.336 [0.285, 0.393]	0.534 [0.473, 0.595]

Citation

@inproceedings{lyu2026debiased,
  title={Debiased Model-based Representations for Sample-efficient Continuous Control},
  author={Jiafei Lyu and Zichuan Lin and Scott Fujimoto and Kai Yang and Yangkun Chen and Saiyong Yang and Zongqing Lu and Deheng Ye},
  booktitle={Forty-third International Conference on Machine Learning},
  year={2026},
  url={https://openreview.net/forum?id=ZP1p8k106p}
}

Acknowledgements

DR.Q builds upon the MR.Q codebase by Facebook Research. We thank the authors of TD7, TDMPC2, MR.Q, FoG, SimBa, SimbaV2, DrQ-v2, DreamerV3, and PPO for their open-source implementations used as baselines.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning