UD7
Provable Generalization of Clipped Double Q-Learning for Variance Reduction and Sample Efficiency
PyTorch Implementation
This repository contains a PyTorch implementation of UD7 of the paper:
Provable Generalization of Clipped Double Q-Learning for Variance Reduction and Sample Efficiency
Jangwon Kim, Jiseok Jeong, Soohee Han
Neurocomputing, Volume 673, 7 April 2026, 132772
Paper Link
https://www.sciencedirect.com/science/article/abs/pii/S0925231226001694
UD7 is an off-policy actorβcritic algorithm that builds on a TD7-style training pipeline, while replacing the critic target formulation with UBOC.
1) Background: Clipped Double Q-Learning (CDQ)
Clipped double Q-learning is a widely-used bias correction in actor-critic methods (e.g., TD3). It maintains two critics and uses the minimum of the two as the TD target:
Strengths (why CDQ is popular)
- Effective overestimation control: taking a minimum is conservative, often preventing exploding Q-values.
- Robust baseline behavior: works well across many continuous-control tasks.
Limitations
- High variance: when critics are poorly learned early on, the min operator can yield high-variance TD targets, destabilizing TD learning and reducing sample efficiency.
UBOC is motivated by a concrete question:
Can we obtain the same expected target value as CDQ, but with smaller variance?
2) UBOC: Uncertainty-Based Overestimation Correction
UBOC views the critic outputs as a distribution of Q estimates (because function approximation is noisy).
Instead of using min(Q1, Q2), UBOC uses N critics to estimate:
- a mean (m),
- an (unbiased) standard deviation , and then forms a corrected value:
where (x>0) controls conservativeness.
2.1 Expectation equivalence to clipped double-Q
Under the assumption that critic estimates behave like i.i.d. samples from a normal distribution, we can derive:
This is the key insight:
- choosing makes the corrected estimate match CDQ in expectation.
2.2 Variance reduction
We can further prove that the estimator
has strictly smaller variance than the CDQ minimum-based target, and the variance gap is strictly positive for all .
As , the maximum achievable variance reduction is upper-bounded by:
It means that
- UBOC does not only βbias-correctβ; it reduces noise in TD targets.
- This is especially important early in training, where noisy targets can derail learning.
2.3 UBOC TD target
Using N target critics , compute:
Mean
Unbiased variance (Approximation)
Then the UBOC target is:
where can be computed with target policy smoothing.
This gives a dynamic bias correction driven by critic uncertainty.
3) UD7: TD7 + UBOC Targets
UD7 integrates UBOC into a TD7-style pipeline and emphasizes strong sample efficiency.
- UD7 uses the TD7 background for practical stability/efficiency.
- The main difference from TD7 is the critic training target: UD7 uses UBOC targets and a multi-critic ensemble (commonly N=5).
If you already have a TD7 baseline, UD7 is best viewed as:
βswap the target rule + use N critics, then keep the rest of the training recipe.β
4) Performance
5) Computational Overhead
Runtime figure (tested on RTX 3090 Ti + Intel i7-12700):
Citation
@article{kim2026provable,
title={Provable generalization of clipped double Q-learning for variance reduction and sample efficiency},
author={Kim, Jangwon and Jeong, Jiseok and Han, Soohee},
journal={Neurocomputing},
pages={132772},
year={2026},
publisher={Elsevier}
}