divelab
/

OPDLM-8B

Text Generation

diffusion-language-model

on-policy-distillation

Model card Files Files and versions

OPDLM-8B / README.md

shubhamprshr's picture

Update README.md

1aa9ccf verified about 19 hours ago

|

history blame contribute delete

2.19 kB

	---
	license: mit
	language:
	- en
	tags:
	- DLLM
	- diffusion-language-model
	- on-policy-distillation
	- post-training
	library_name: transformers
	pipeline_tag: text-generation
	base_model: Qwen/Qwen3-8B
	datasets:
	- divelab/opdlm_train_data
	arxiv: 2606.06712
	---
	# OPDLM-8B

	OPDLM-8B is a block diffusion language model (DLM) obtained by post-training an
	autoregressive language model (ARLM) into a diffusion language model via
	on-policy distillation. arXiv report: [arxiv.org/abs/2606.06712](https://arxiv.org/abs/2606.06712)

	## Highlights
	- Converted, not pretrained from scratch: built from a strong ARLM, reusing its prior.
	- Training-efficient: ~0.066B tokens of conversion vs. ~50B tokens for from-scratch DLM training (same base ARLM).
	- Inference-efficient: parallel token decoding via block diffusion.

	## Model Details
	- Developed by: DIVE Lab, Texas A&M University
	- Base model: [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
	- Model type: Block diffusion language model (decoder-based)
	- Block size: 4
	- Parameters: ~8B
	- Language: English
	- License: MIT

	## Training
	- Method: On-policy distillation from a frozen ARLM teacher into a block DLM student.
	- Conversion budget: ~0.066B tokens
	- Data: [opdlm_train_data](https://huggingface.co/datasets/divelab/opdlm_train_data)

	## Evaluation
	\| Benchmark \| Score \|
	\|-------------\|-------\|
	\| MMLU \| 70.9 \|
	\| MMLU-Pro \| 53.7 \|
	\| GPQA-Diamond\| 36.1 \|
	\| IFEval \| 50.1 \|
	\| GSM8K \| 87.1 \|
	\| MATH500 \| 71.2 \|
	\| AIME-24 \| 14.7 \|
	\| AIME-25 \| 12.4 \|
	\| HumanEval \| 59.8 \|
	\| MBPP \| 48.7 \|

	Decoding: static (one token per step)

	## Citation
	```bibtex
	@misc{su2026dataefficientautoregressivetodiffusionlanguagemodels,
	title={Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation},
	author={Xingyu Su and Jacob Helwig and Shubham Parashar and Atharv Chagi and Lakshmi Jotsna and Degui Zhi and James Caverlee and Dileep Kalathil and Shuiwang Ji},
	year={2026},
	eprint={2606.06712},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2606.06712},
	}
	```