File size: 2,666 Bytes
263cb6e
 
 
 
 
 
 
 
 
 
 
 
 
0723dc6
 
 
263cb6e
 
 
6827760
263cb6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
license: apache-2.0
tags:
- muiltimodal
- discrete-flow-matching
- unifed-model
---

## 1. Introduction
The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing **FUDOKI**, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

<table>
  <tr>
    <td width="25%"><img src="teaser.png" alt="image"></td>
    <td width="30%"><img src="understanding.gif" alt="image"></td>
    <td width="29%"><img src="generation.gif" alt="image"></td>
  </tr>
</table>

[FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities](https://arxiv.org/abs/2505.20147).


## 2. Quick Start

Please refer to [**Github Repository**](https://github.com/fudoki-hku/FUDOKI)

## 3. Citation

```
@article{wang2025fudokidiscreteflowbasedunified,
    title={FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities}, 
    author={Jin Wang and Yao Lai and Aoxue Li and Shifeng Zhang and Jiacheng Sun and Ning Kang and Chengyue Wu and Zhenguo Li and Ping Luo},
    year={2025},
    eprint={2505.20147},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2505.20147}
}
```

## 4. Contact

**Point of Contact:** [Jin Wang](mailto:wj0529@connect.hku.hk)