jangwon-kim-cocel's picture
Update README.md
130a127 verified
---
license: mit
language:
- en
pipeline_tag: reinforcement-learning
tags:
- rl
- bayesian
- policy
- distillation
- offline
- offline-rl
- pruning
- bpd
- BPD
---
<div align="center">
<h1>Bayesian Policy Distillation</h1>
<h3>Towards Lightweight and Fast Neural Policy Networks</h3>
<a href="https://www.python.org/">
<img src="https://img.shields.io/badge/Python-3.7+-blue?logo=python&style=flat-square" alt="Python Badge"/>
</a>
&nbsp;&nbsp;
<a href="https://pytorch.org/">
<img src="https://img.shields.io/badge/PyTorch-1.8+-EE4C2C?logo=pytorch&style=flat-square" alt="PyTorch Badge"/>
</a>
&nbsp;&nbsp;
<a href="https://doi.org/10.1016/j.engappai.2025.113539">
<img src="https://img.shields.io/badge/EAAI%202026-Published-success?style=flat-square" alt="EAAI Badge"/>
</a>
&nbsp;&nbsp;
<a href="https://www.elsevier.com/">
<img src="https://img.shields.io/badge/Elsevier-Journal-orange?style=flat-square" alt="Elsevier Badge"/>
</a>
<br/><br/>
<img src="./gif_for_readme.gif" width="550px"/>
</div>
---
## Engineering Applications of Artificial Intelligence (EAAI 2026)
### PyTorch Implementation
This repository contains a PyTorch implementation of **Bayesian Policy Distillation (BPD)** of the paper:
> **Bayesian policy distillation: Towards lightweight and fast neural policy networks**
> Jangwon Kim, Yoonsu Jang, Jonghyeok Park, Yoonhee Gil, Soohee Han
> *Engineering Applications of Artificial Intelligence*, Volume 166, 2026
## 📄 Paper Link
> **DOI:** https://doi.org/10.1016/j.engappai.2025.113539
> **Journal:** Engineering Applications of Artificial Intelligence
---
## Bayesian Policy Distillation
BPD achieves extreme policy compression through offline reinforcement learning by:
1. **Bayesian Neural Networks**: Uncertainty-driven dynamic weight pruning
2. **Sparse Variational Dropout**: Automatic sparsity induction via KL regularization
3. **Offline RL Framework**: Value optimization + behavior cloning
$$
\mathcal{L}_{BPD}(\theta, \alpha) = -\lambda Q_{\psi_1}(s, \pi_\omega(s)) + \frac{|\mathcal{D}|}{M}\sum_{m=1}^{M}(\pi_{\omega_m}(s_m) - a_m)^2 + \eta \cdot D_{KL}(q(\omega|\theta,\alpha) \| p(\omega))
$$
**Key Results:**
- **~98% compression** (1.5-2.5% sparsity) while maintaining performance
- **4.5× faster inference** on embedded systems
- Successfully deployed on real inverted pendulum with **78% inference time reduction**
---
## Quick Start
### Basic Training
```bash
python main.py --env-name Hopper-v3 --level expert --random-seed 1
```
### Custom Configuration
```bash
python main.py \
--env-name Walker2d-v3 \
--level medium \
--student-hidden-dims "(128, 128)" \
--alpha-threshold 2 \
--nu 4 \
--h 0.5
```
### Available Environments
- `Hopper-v3`, `Walker2d-v3`, `HalfCheetah-v3`, `Ant-v3`
### Teacher Policy Levels
- `expert`: High-performance teacher policy
- `medium`: Moderate-performance teacher policy
---
## Key Hyperparameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--student-hidden-dims` | (128, 128) | Student network hidden layer sizes |
| `--alpha-threshold` | 2 | Pruning threshold for log(α) (higher = less compression) |
| `--nu` | 4 | KL weight annealing speed |
| `--h` | 0.5 | Q-value loss coefficient |
| `--batch-size` | 256 | Mini-batch size |
| `--max-teaching-count` | 1000000 | Total training iterations |
| `--eval-freq` | 5000 | Evaluation frequency |
**Adjusting Compression:**
- `--alpha-threshold 3-4`: Conservative pruning
- `--alpha-threshold 2`: Balanced [default]
- `--alpha-threshold 1`: Aggressive pruning
---
## Results
### MuJoCo Benchmark (Expert Teacher)
| Environment | Teacher | BPD (Ours) | Sparsity | Compression |
|------------|---------|------------|----------|-------------|
| Ant-v3 | 5364 | 5455 | 2.40% | **41.7×** |
| Walker2d-v3 | 5357 | 4817 | 1.68% | **59.5×** |
| Hopper-v3 | 3583 | 3134 | 1.35% | **74.1×** |
| HalfCheetah-v3 | 11432 | 10355 | 2.21% | **45.2×** |
### Real Hardware (Inverted Pendulum)
- **Inference**: 1.36ms → 0.30ms (**4.5× faster**)
- **Memory**: 290.82KB → 4.43KB (**98.5% reduction**)
- **Parameters**: 72,705 → 1,108 (**65.6× compression**)
---
## Citation
```bibtex
@article{kim2026bayesian,
title={Bayesian policy distillation: Towards lightweight and fast neural policy networks},
author={Kim, Jangwon and Jang, Yoonsu and Park, Jonghyeok and Gil, Yoonhee and Han, Soohee},
journal={Engineering Applications of Artificial Intelligence},
volume={166},
pages={113539},
year={2026},
publisher={Elsevier}
}
```