Bayesian Policy Distillation

Towards Lightweight and Fast Neural Policy Networks

Python Badge    PyTorch Badge    EAAI Badge    Elsevier Badge


Engineering Applications of Artificial Intelligence (EAAI 2026)

PyTorch Implementation

This repository contains a PyTorch implementation of Bayesian Policy Distillation (BPD) of the paper:

Bayesian policy distillation: Towards lightweight and fast neural policy networks
Jangwon Kim, Yoonsu Jang, Jonghyeok Park, Yoonhee Gil, Soohee Han
Engineering Applications of Artificial Intelligence, Volume 166, 2026

πŸ“„ Paper Link

DOI: https://doi.org/10.1016/j.engappai.2025.113539
Journal: Engineering Applications of Artificial Intelligence


Bayesian Policy Distillation

BPD achieves extreme policy compression through offline reinforcement learning by:

  1. Bayesian Neural Networks: Uncertainty-driven dynamic weight pruning
  2. Sparse Variational Dropout: Automatic sparsity induction via KL regularization
  3. Offline RL Framework: Value optimization + behavior cloning

LBPD(ΞΈ,Ξ±)=βˆ’Ξ»Qψ1(s,πω(s))+∣D∣Mβˆ‘m=1M(πωm(sm)βˆ’am)2+Ξ·β‹…DKL(q(Ο‰βˆ£ΞΈ,Ξ±)βˆ₯p(Ο‰)) \mathcal{L}_{BPD}(\theta, \alpha) = -\lambda Q_{\psi_1}(s, \pi_\omega(s)) + \frac{|\mathcal{D}|}{M}\sum_{m=1}^{M}(\pi_{\omega_m}(s_m) - a_m)^2 + \eta \cdot D_{KL}(q(\omega|\theta,\alpha) \| p(\omega))

Key Results:

  • ~98% compression (1.5-2.5% sparsity) while maintaining performance
  • 4.5Γ— faster inference on embedded systems
  • Successfully deployed on real inverted pendulum with 78% inference time reduction

Quick Start

Basic Training

python main.py --env-name Hopper-v3 --level expert --random-seed 1

Custom Configuration

python main.py \
    --env-name Walker2d-v3 \
    --level medium \
    --student-hidden-dims "(128, 128)" \
    --alpha-threshold 2 \
    --nu 4 \
    --h 0.5

Available Environments

  • Hopper-v3, Walker2d-v3, HalfCheetah-v3, Ant-v3

Teacher Policy Levels

  • expert: High-performance teacher policy
  • medium: Moderate-performance teacher policy

Key Hyperparameters

Parameter Default Description
--student-hidden-dims (128, 128) Student network hidden layer sizes
--alpha-threshold 2 Pruning threshold for log(Ξ±) (higher = less compression)
--nu 4 KL weight annealing speed
--h 0.5 Q-value loss coefficient
--batch-size 256 Mini-batch size
--max-teaching-count 1000000 Total training iterations
--eval-freq 5000 Evaluation frequency

Adjusting Compression:

  • --alpha-threshold 3-4: Conservative pruning
  • --alpha-threshold 2: Balanced [default]
  • --alpha-threshold 1: Aggressive pruning

Results

MuJoCo Benchmark (Expert Teacher)

Environment Teacher BPD (Ours) Sparsity Compression
Ant-v3 5364 5455 2.40% 41.7Γ—
Walker2d-v3 5357 4817 1.68% 59.5Γ—
Hopper-v3 3583 3134 1.35% 74.1Γ—
HalfCheetah-v3 11432 10355 2.21% 45.2Γ—

Real Hardware (Inverted Pendulum)

  • Inference: 1.36ms β†’ 0.30ms (4.5Γ— faster)
  • Memory: 290.82KB β†’ 4.43KB (98.5% reduction)
  • Parameters: 72,705 β†’ 1,108 (65.6Γ— compression)

Citation

@article{kim2026bayesian,
  title={Bayesian policy distillation: Towards lightweight and fast neural policy networks},
  author={Kim, Jangwon and Jang, Yoonsu and Park, Jonghyeok and Gil, Yoonhee and Han, Soohee},
  journal={Engineering Applications of Artificial Intelligence},
  volume={166},
  pages={113539},
  year={2026},
  publisher={Elsevier}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading