Engineering Applications of Artificial Intelligence (EAAI 2026)
PyTorch Implementation
This repository contains a PyTorch implementation of Bayesian Policy Distillation (BPD) of the paper:
Bayesian policy distillation: Towards lightweight and fast neural policy networks
Jangwon Kim, Yoonsu Jang, Jonghyeok Park, Yoonhee Gil, Soohee Han
Engineering Applications of Artificial Intelligence, Volume 166, 2026
π Paper Link
DOI: https://doi.org/10.1016/j.engappai.2025.113539
Journal: Engineering Applications of Artificial Intelligence
Bayesian Policy Distillation
BPD achieves extreme policy compression through offline reinforcement learning by:
- Bayesian Neural Networks: Uncertainty-driven dynamic weight pruning
- Sparse Variational Dropout: Automatic sparsity induction via KL regularization
- Offline RL Framework: Value optimization + behavior cloning
Key Results:
- ~98% compression (1.5-2.5% sparsity) while maintaining performance
- 4.5Γ faster inference on embedded systems
- Successfully deployed on real inverted pendulum with 78% inference time reduction
Quick Start
Basic Training
python main.py --env-name Hopper-v3 --level expert --random-seed 1
Custom Configuration
python main.py \
--env-name Walker2d-v3 \
--level medium \
--student-hidden-dims "(128, 128)" \
--alpha-threshold 2 \
--nu 4 \
--h 0.5
Available Environments
Hopper-v3,Walker2d-v3,HalfCheetah-v3,Ant-v3
Teacher Policy Levels
expert: High-performance teacher policymedium: Moderate-performance teacher policy
Key Hyperparameters
| Parameter | Default | Description |
|---|---|---|
--student-hidden-dims |
(128, 128) | Student network hidden layer sizes |
--alpha-threshold |
2 | Pruning threshold for log(Ξ±) (higher = less compression) |
--nu |
4 | KL weight annealing speed |
--h |
0.5 | Q-value loss coefficient |
--batch-size |
256 | Mini-batch size |
--max-teaching-count |
1000000 | Total training iterations |
--eval-freq |
5000 | Evaluation frequency |
Adjusting Compression:
--alpha-threshold 3-4: Conservative pruning--alpha-threshold 2: Balanced [default]--alpha-threshold 1: Aggressive pruning
Results
MuJoCo Benchmark (Expert Teacher)
| Environment | Teacher | BPD (Ours) | Sparsity | Compression |
|---|---|---|---|---|
| Ant-v3 | 5364 | 5455 | 2.40% | 41.7Γ |
| Walker2d-v3 | 5357 | 4817 | 1.68% | 59.5Γ |
| Hopper-v3 | 3583 | 3134 | 1.35% | 74.1Γ |
| HalfCheetah-v3 | 11432 | 10355 | 2.21% | 45.2Γ |
Real Hardware (Inverted Pendulum)
- Inference: 1.36ms β 0.30ms (4.5Γ faster)
- Memory: 290.82KB β 4.43KB (98.5% reduction)
- Parameters: 72,705 β 1,108 (65.6Γ compression)
Citation
@article{kim2026bayesian,
title={Bayesian policy distillation: Towards lightweight and fast neural policy networks},
author={Kim, Jangwon and Jang, Yoonsu and Park, Jonghyeok and Gil, Yoonhee and Han, Soohee},
journal={Engineering Applications of Artificial Intelligence},
volume={166},
pages={113539},
year={2026},
publisher={Elsevier}
}