FS-DFM-1.3B-ESPO-mu8
FS-DFM 1.3B trained with ESPO mu=8 (ELBO-based Sequence-level Policy Optimization). First RL method to improve FS-DFM over SFT: 87.1% nonzero rate / 0.198 average reward on 124 test tasks (+18.6pp over SFT). Only ELBO-based methods generalize to DFM architectures.
Paper
Concentrate or Collapse: When Reinforcement Learning Meets Diffusion Language Models for Web Planning
- Author: Muhammad Enrizky Brillian
- Institution: University of Toronto Scarborough
- Code: https://github.com/billy-enrizky/openbrowser-ai
Training Details
- Dataset: FormFactory (992 train / 124 val / 124 test tasks, 25 form types, 8 domains)
- Infrastructure: NVIDIA L40S (ReFusion) / A10G (FS-DFM) on Modal.com
- Framework: PyTorch + PEFT (LoRA/QLoRA)
- Training prompts: 50 (sequence-level), G=4 rollouts per prompt
Citation
@article{brillian2026flowgrpo,
title={Concentrate or Collapse: When Reinforcement Learning Meets Diffusion Language Models for Web Planning},
author={Brillian, Muhammad Enrizky},
year={2026}
}