ReFusion-8B-MDPO / README.md
billyenrizky's picture
Update research title to: Concentrate or Collapse
2da374e verified
metadata
tags:
  - discrete-flow-matching
  - web-action-planning
  - formfactory
  - reinforcement-learning
  - openbrowser
  - sequence-level-rl
license: apache-2.0

ReFusion-8B-MDPO

ReFusion 8B trained with MDPO (Masked Diffusion Policy Optimization). Best result in the paper: 91.9% nonzero rate / 0.445 average reward on 124 test tasks (+31.4pp over SFT). Temporal advantage decomposition with mu=1.

Paper

Concentrate or Collapse: When Reinforcement Learning Meets Diffusion Language Models for Web Planning

Training Details

  • Dataset: FormFactory (992 train / 124 val / 124 test tasks, 25 form types, 8 domains)
  • Infrastructure: NVIDIA L40S (ReFusion) / A10G (FS-DFM) on Modal.com
  • Framework: PyTorch + PEFT (LoRA/QLoRA)
  • Training prompts: 50 (sequence-level), G=4 rollouts per prompt

Citation

@article{brillian2026flowgrpo,
  title={Concentrate or Collapse: When Reinforcement Learning Meets Diffusion Language Models for Web Planning},
  author={Brillian, Muhammad Enrizky},
  year={2026}
}