metadata
tags:
- discrete-flow-matching
- web-action-planning
- formfactory
- reinforcement-learning
- openbrowser
license: apache-2.0
ReFusion-8B-ESPO
ReFusion 8B trained with ESPO v19 (ELBO-based Sequence-level Policy Optimization). Sequence-level RL prevents the training collapse seen in token-level methods. +1.6pp nonzero rate improvement on test split vs SFT. Part of the STAD80 project: Generative Action Planning via Discrete Flow Matching.
Paper
Concentrate or Collapse: When Reinforcement Learning Meets Diffusion Language Models for Web Planning
- Author: Muhammad Enrizky Brillian
- Institution: University of Toronto Scarborough
Training Details
- Dataset: FormFactory (992 train / 124 val / 124 test tasks, 25 form types, 8 domains)
- Infrastructure: Single NVIDIA A10G GPU (24GB VRAM) on Anyscale
- Framework: PyTorch + PEFT (LoRA/QLoRA)
Citation
If you use this model, please cite:
@article{brillian2026flowgrpo,
title={Concentrate or Collapse: When Reinforcement Learning Meets Diffusion Language Models for Web Planning},
author={Brillian, Muhammad Enrizky},
year={2026}
}
This model was trained and evaluated on the FormFactory benchmark:
@misc{li2025formfactory,
title={FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents},
author={Bobo Li and Yuheng Wang and Hao Fei and Juncheng Li and Wei Ji and Mong-Li Lee and Wynne Hsu},
year={2025},
eprint={2506.01520},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01520}
}