File size: 4,612 Bytes
07c62be de5a233 07c62be 96170c3 e9da7fa 96170c3 ec4e668 96170c3 130a127 96170c3 130a127 96170c3 07c62be |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
license: mit
language:
- en
pipeline_tag: reinforcement-learning
tags:
- rl
- bayesian
- policy
- distillation
- offline
- offline-rl
- pruning
- bpd
- BPD
---
<div align="center">
<h1>Bayesian Policy Distillation</h1>
<h3>Towards Lightweight and Fast Neural Policy Networks</h3>
<a href="https://www.python.org/">
<img src="https://img.shields.io/badge/Python-3.7+-blue?logo=python&style=flat-square" alt="Python Badge"/>
</a>
<a href="https://pytorch.org/">
<img src="https://img.shields.io/badge/PyTorch-1.8+-EE4C2C?logo=pytorch&style=flat-square" alt="PyTorch Badge"/>
</a>
<a href="https://doi.org/10.1016/j.engappai.2025.113539">
<img src="https://img.shields.io/badge/EAAI%202026-Published-success?style=flat-square" alt="EAAI Badge"/>
</a>
<a href="https://www.elsevier.com/">
<img src="https://img.shields.io/badge/Elsevier-Journal-orange?style=flat-square" alt="Elsevier Badge"/>
</a>
<br/><br/>
<img src="./gif_for_readme.gif" width="550px"/>
</div>
---
## Engineering Applications of Artificial Intelligence (EAAI 2026)
### PyTorch Implementation
This repository contains a PyTorch implementation of **Bayesian Policy Distillation (BPD)** of the paper:
> **Bayesian policy distillation: Towards lightweight and fast neural policy networks**
> Jangwon Kim, Yoonsu Jang, Jonghyeok Park, Yoonhee Gil, Soohee Han
> *Engineering Applications of Artificial Intelligence*, Volume 166, 2026
## 📄 Paper Link
> **DOI:** https://doi.org/10.1016/j.engappai.2025.113539
> **Journal:** Engineering Applications of Artificial Intelligence
---
## Bayesian Policy Distillation
BPD achieves extreme policy compression through offline reinforcement learning by:
1. **Bayesian Neural Networks**: Uncertainty-driven dynamic weight pruning
2. **Sparse Variational Dropout**: Automatic sparsity induction via KL regularization
3. **Offline RL Framework**: Value optimization + behavior cloning
$$
\mathcal{L}_{BPD}(\theta, \alpha) = -\lambda Q_{\psi_1}(s, \pi_\omega(s)) + \frac{|\mathcal{D}|}{M}\sum_{m=1}^{M}(\pi_{\omega_m}(s_m) - a_m)^2 + \eta \cdot D_{KL}(q(\omega|\theta,\alpha) \| p(\omega))
$$
**Key Results:**
- **~98% compression** (1.5-2.5% sparsity) while maintaining performance
- **4.5× faster inference** on embedded systems
- Successfully deployed on real inverted pendulum with **78% inference time reduction**
---
## Quick Start
### Basic Training
```bash
python main.py --env-name Hopper-v3 --level expert --random-seed 1
```
### Custom Configuration
```bash
python main.py \
--env-name Walker2d-v3 \
--level medium \
--student-hidden-dims "(128, 128)" \
--alpha-threshold 2 \
--nu 4 \
--h 0.5
```
### Available Environments
- `Hopper-v3`, `Walker2d-v3`, `HalfCheetah-v3`, `Ant-v3`
### Teacher Policy Levels
- `expert`: High-performance teacher policy
- `medium`: Moderate-performance teacher policy
---
## Key Hyperparameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--student-hidden-dims` | (128, 128) | Student network hidden layer sizes |
| `--alpha-threshold` | 2 | Pruning threshold for log(α) (higher = less compression) |
| `--nu` | 4 | KL weight annealing speed |
| `--h` | 0.5 | Q-value loss coefficient |
| `--batch-size` | 256 | Mini-batch size |
| `--max-teaching-count` | 1000000 | Total training iterations |
| `--eval-freq` | 5000 | Evaluation frequency |
**Adjusting Compression:**
- `--alpha-threshold 3-4`: Conservative pruning
- `--alpha-threshold 2`: Balanced [default]
- `--alpha-threshold 1`: Aggressive pruning
---
## Results
### MuJoCo Benchmark (Expert Teacher)
| Environment | Teacher | BPD (Ours) | Sparsity | Compression |
|------------|---------|------------|----------|-------------|
| Ant-v3 | 5364 | 5455 | 2.40% | **41.7×** |
| Walker2d-v3 | 5357 | 4817 | 1.68% | **59.5×** |
| Hopper-v3 | 3583 | 3134 | 1.35% | **74.1×** |
| HalfCheetah-v3 | 11432 | 10355 | 2.21% | **45.2×** |
### Real Hardware (Inverted Pendulum)
- **Inference**: 1.36ms → 0.30ms (**4.5× faster**)
- **Memory**: 290.82KB → 4.43KB (**98.5% reduction**)
- **Parameters**: 72,705 → 1,108 (**65.6× compression**)
---
## Citation
```bibtex
@article{kim2026bayesian,
title={Bayesian policy distillation: Towards lightweight and fast neural policy networks},
author={Kim, Jangwon and Jang, Yoonsu and Park, Jonghyeok and Gil, Yoonhee and Han, Soohee},
journal={Engineering Applications of Artificial Intelligence},
volume={166},
pages={113539},
year={2026},
publisher={Elsevier}
}
``` |