jangwon-kim-cocel's picture
Update README.md
130a127 verified
metadata
license: mit
language:
  - en
pipeline_tag: reinforcement-learning
tags:
  - rl
  - bayesian
  - policy
  - distillation
  - offline
  - offline-rl
  - pruning
  - bpd
  - BPD

Bayesian Policy Distillation

Towards Lightweight and Fast Neural Policy Networks

Python Badge    PyTorch Badge    EAAI Badge    Elsevier Badge


Engineering Applications of Artificial Intelligence (EAAI 2026)

PyTorch Implementation

This repository contains a PyTorch implementation of Bayesian Policy Distillation (BPD) of the paper:

Bayesian policy distillation: Towards lightweight and fast neural policy networks
Jangwon Kim, Yoonsu Jang, Jonghyeok Park, Yoonhee Gil, Soohee Han
Engineering Applications of Artificial Intelligence, Volume 166, 2026

πŸ“„ Paper Link

DOI: https://doi.org/10.1016/j.engappai.2025.113539
Journal: Engineering Applications of Artificial Intelligence


Bayesian Policy Distillation

BPD achieves extreme policy compression through offline reinforcement learning by:

  1. Bayesian Neural Networks: Uncertainty-driven dynamic weight pruning
  2. Sparse Variational Dropout: Automatic sparsity induction via KL regularization
  3. Offline RL Framework: Value optimization + behavior cloning

LBPD(ΞΈ,Ξ±)=βˆ’Ξ»Qψ1(s,πω(s))+∣D∣Mβˆ‘m=1M(πωm(sm)βˆ’am)2+Ξ·β‹…DKL(q(Ο‰βˆ£ΞΈ,Ξ±)βˆ₯p(Ο‰)) \mathcal{L}_{BPD}(\theta, \alpha) = -\lambda Q_{\psi_1}(s, \pi_\omega(s)) + \frac{|\mathcal{D}|}{M}\sum_{m=1}^{M}(\pi_{\omega_m}(s_m) - a_m)^2 + \eta \cdot D_{KL}(q(\omega|\theta,\alpha) \| p(\omega))

Key Results:

  • ~98% compression (1.5-2.5% sparsity) while maintaining performance
  • 4.5Γ— faster inference on embedded systems
  • Successfully deployed on real inverted pendulum with 78% inference time reduction

Quick Start

Basic Training

python main.py --env-name Hopper-v3 --level expert --random-seed 1

Custom Configuration

python main.py \
    --env-name Walker2d-v3 \
    --level medium \
    --student-hidden-dims "(128, 128)" \
    --alpha-threshold 2 \
    --nu 4 \
    --h 0.5

Available Environments

  • Hopper-v3, Walker2d-v3, HalfCheetah-v3, Ant-v3

Teacher Policy Levels

  • expert: High-performance teacher policy
  • medium: Moderate-performance teacher policy

Key Hyperparameters

Parameter Default Description
--student-hidden-dims (128, 128) Student network hidden layer sizes
--alpha-threshold 2 Pruning threshold for log(Ξ±) (higher = less compression)
--nu 4 KL weight annealing speed
--h 0.5 Q-value loss coefficient
--batch-size 256 Mini-batch size
--max-teaching-count 1000000 Total training iterations
--eval-freq 5000 Evaluation frequency

Adjusting Compression:

  • --alpha-threshold 3-4: Conservative pruning
  • --alpha-threshold 2: Balanced [default]
  • --alpha-threshold 1: Aggressive pruning

Results

MuJoCo Benchmark (Expert Teacher)

Environment Teacher BPD (Ours) Sparsity Compression
Ant-v3 5364 5455 2.40% 41.7Γ—
Walker2d-v3 5357 4817 1.68% 59.5Γ—
Hopper-v3 3583 3134 1.35% 74.1Γ—
HalfCheetah-v3 11432 10355 2.21% 45.2Γ—

Real Hardware (Inverted Pendulum)

  • Inference: 1.36ms β†’ 0.30ms (4.5Γ— faster)
  • Memory: 290.82KB β†’ 4.43KB (98.5% reduction)
  • Parameters: 72,705 β†’ 1,108 (65.6Γ— compression)

Citation

@article{kim2026bayesian,
  title={Bayesian policy distillation: Towards lightweight and fast neural policy networks},
  author={Kim, Jangwon and Jang, Yoonsu and Park, Jonghyeok and Gil, Yoonhee and Han, Soohee},
  journal={Engineering Applications of Artificial Intelligence},
  volume={166},
  pages={113539},
  year={2026},
  publisher={Elsevier}
}