UD7 / README.md

Update README.md

37b3495 verified 13 days ago

5.55 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: reinforcement-learning
	tags:
	- ud7
	- uboc
	- rl
	- reinforcement-learning
	- pytorch
	- off-policy
	---

	<div align="center">
	<h1>UD7</h1>
	<h3>Provable Generalization of Clipped Double Q-Learning for Variance Reduction and Sample Efficiency</h3>

	<a href="https://www.python.org/">
	<img src="https://img.shields.io/badge/Python-3.7+-blue?logo=python&style=flat-square" alt="Python Badge"/>
	</a>

	<a href="https://pytorch.org/">
	<img src="https://img.shields.io/badge/PyTorch-1.8+-EE4C2C?logo=pytorch&style=flat-square" alt="PyTorch Badge"/>
	</a>

	<a href="https://www.sciencedirect.com/journal/neurocomputing">
	<img src="https://img.shields.io/badge/Neurocomputing-Published-success?style=flat-square" alt="Neurocomputing Badge"/>
	</a>

	<a href="https://www.elsevier.com/">
	<img src="https://img.shields.io/badge/Elsevier-Journal-orange?style=flat-square" alt="Elsevier Badge"/>
	</a>
	</div>

	---

	## PyTorch Implementation

	This repository contains a PyTorch implementation of UD7 of the paper:

	> Provable Generalization of Clipped Double Q-Learning for Variance Reduction and Sample Efficiency
	> Jangwon Kim, Jiseok Jeong, Soohee Han
	> Neurocomputing, Volume 673, 7 April 2026, 132772

	### Paper Link
	https://www.sciencedirect.com/science/article/abs/pii/S0925231226001694

	---

	UD7 is an off-policy actor–critic algorithm that builds on a TD7-style training pipeline, while replacing the critic target formulation with UBOC.

	---

	## 1) Background: Clipped Double Q-Learning (CDQ)

	Clipped double Q-learning is a widely-used bias correction in actor-critic methods (e.g., TD3). It maintains two critics and uses the minimum of the two as the TD target:

	$$
	y_{\text{CDQ}}(s_t,a_t)=r_t+\gamma \min_{i\in\{1,2\}} \bar Q_i(s_{t+1}, a_{t+1})
	$$

	### Strengths (why CDQ is popular)
	- Effective overestimation control: taking a minimum is conservative, often preventing exploding Q-values.
	- Robust baseline behavior: works well across many continuous-control tasks.

	### Limitations
	- High variance: when critics are poorly learned early on, the min operator can yield high-variance TD targets, destabilizing TD learning and reducing sample efficiency.

	UBOC is motivated by a concrete question:
	> Can we obtain the same expected target value as CDQ, but with smaller variance?

	---

	## 2) UBOC: Uncertainty-Based Overestimation Correction

	UBOC views the critic outputs as a distribution of Q estimates (because function approximation is noisy).
	Instead of using `min(Q1, Q2)`, UBOC uses N critics to estimate:
	- a mean $m$,
	- an (unbiased) standard deviation \$\hat{s}\$,
	and then forms a corrected value:

	$$
	Q_{\text{corrected}} = m - x\cdot \hat s
	$$

	where $x>0$ controls conservativeness.

	### 2.1 Expectation equivalence to clipped double-Q

	Under the assumption that critic estimates behave like i.i.d. samples from a normal distribution, we can derive:

	$$
	\mathbb{E}\left[\min(Q_A, Q_B)\right]=\mathbb{E}\left[m - \frac{\hat s}{\sqrt{\pi}}\right]
	$$

	This is the key insight:
	- choosing \$x=1/\sqrt{\pi}\$ makes the corrected estimate match CDQ in expectation.

	### 2.2 Variance reduction

	We can further prove that the estimator

	$$
	m - \frac{\hat s}{\sqrt{\pi}}
	$$

	has strictly smaller variance than the CDQ minimum-based target, and the variance gap is strictly positive for all \$N\ge 2\$.

	As \$N\to\infty\$, the maximum achievable variance reduction is upper-bounded by:

	$$
	\sigma^2\left(1-\frac{1}{\pi}\right)
	$$

	It means that
	- UBOC does not only “bias-correct”; it reduces noise in TD targets.
	- This is especially important early in training, where noisy targets can derail learning.

	### 2.3 UBOC TD target

	Using N target critics \$Q_1,\dots, Q_N\$, compute:


	Mean

	$$
	m(s,a) = \frac{1}{N}\sum_{i=1}^N Q_i(s,a)
	$$

	Unbiased variance (Approximation)

	$$
	\hat v(s,a)=\frac{1}{N-1}\sum_{i=1}^N \left( Q_i(s,a)-m(s,a)\right)^2
	$$

	Then the UBOC target is:

	$$
	y_{\text{UBOC}}(s_t,a_t)=r_t + \gamma\left(m(s_{t+1},a_{t+1}) - \sqrt{\frac{\hat v(s_{t+1},a_{t+1})}{\pi}}\right)
	$$

	where \$a_{t+1}\$ can be computed with target policy smoothing.

	This gives a dynamic bias correction driven by critic uncertainty.

	---

	## 3) UD7: TD7 + UBOC Targets

	UD7 integrates UBOC into a TD7-style pipeline and emphasizes strong sample efficiency.

	- UD7 uses the TD7 background for practical stability/efficiency.
	- The main difference from TD7 is the critic training target: UD7 uses UBOC targets and a multi-critic ensemble (commonly N=5).

	> If you already have a TD7 baseline, UD7 is best viewed as:
	> “swap the target rule + use N critics, then keep the rest of the training recipe.”

	---

	## 4) Performance

	<div align="center">
	<img src="figures/performance.png" alt="Fig. 1 — Performance comparison on MuJoCo benchmarks" width="800"/>
	</div>

	---

	## 5) Computational Overhead

	Runtime figure (tested on RTX 3090 Ti + Intel i7-12700):

	<div align="center">
	<img src="figures/runtime.png" alt="Fig. 2 — Runtime comparison" width="300"/>
	</div>

	---

	## Citation
	```
	@article{kim2026provable,
	title={Provable generalization of clipped double Q-learning for variance reduction and sample efficiency},
	author={Kim, Jangwon and Jeong, Jiseok and Han, Soohee},
	journal={Neurocomputing},
	pages={132772},
	year={2026},
	publisher={Elsevier}
	}
	```