Image-to-Image
English
File size: 4,253 Bytes
695a42e
 
 
95a4e3b
695a42e
 
db1e23e
695a42e
db1e23e
 
 
 
 
c0e00c7
db1e23e
c0e00c7
 
db1e23e
 
c0e00c7
 
 
 
 
 
 
db1e23e
c0e00c7
 
 
 
22ce776
5cb60dc
db1e23e
 
 
22ce776
db1e23e
 
674551e
 
db1e23e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25f117f
b5b10f0
db1e23e
 
b5b10f0
db1e23e
 
25f117f
db1e23e
 
 
25f117f
db1e23e
25f117f
db1e23e
 
9d83790
 
 
 
db1e23e
9d83790
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
datasets:
- CSU-JPG/VisPrompt5M
- CSU-JPG/VPBench
language:
- en
license: apache-2.0
pipeline_tag: image-to-image
tags:
- flow-matching
- image-generation
- image-editing
- vision-centric
---

<div align="center">
  <h2 align="center" style="margin-top: 0; margin-bottom: 15px;">
    <span style="color:#0052CC">F</span><span style="color:#135FD0">l</span><span style="color:#266CD4">o</span><span style="color:#3979D7">w</span><span style="color:#4C86DB">I</span><span style="color:#6093DF">n</span><span style="color:#73A0E3">O</span><span style="color:#86ADE7">n</span><span style="color:#99BAEB\">e</span>: Unifying Multimodal Generation as 
    <span style="color:#0052CC">I</span><span style="color:#0958CE">m</span><span style="color:#125ED0">a</span><span style="color:#1B64D2">g</span><span style="color:#246AD4">e</span><span style="color:#2D70D6">-</span><span style="color:#3676D8">i</span><span style="color:#3F7CDA\">n</span><span style="color:#4882DC">,</span>&nbsp;<span style="color:#5188DE">I</span><span style="color:#5A8EE0\">m</span><span style="color:#6394E2\">a</span><span style="color:#6C9AE4\">g</span><span style="color:#75A0E6\">e</span><span style="color:#7EA6E8">-</span><span style="color:#87ACEA\">o</span><span style="color:#90B2EC\">u</span><span style="color:#99B8EE\">t</span> Flow Matching
  </h2>
  <p align="center" style="font-size: 15px;">
    <span style="color:#E74C3C; font-weight: bold;">TL;DR:</span> <strong>The first vision-centric image-in, image-out image generation model.</strong>
  </p>
  <p align="center" style="font-size: 16px;">
    <a href="https://csu-jpg.github.io/FlowInOne.github.io/" style="text-decoration: none;">🌐 Homepage</a> | 
    <a href="https://github.com/CSU-JPG/FlowInOne" style="text-decoration: none;">πŸ’» Code</a> | 
    <a href="https://huggingface.co/papers/2604.06757" style="text-decoration: none;">πŸ“„ Paper</a> | 
    <a href="https://huggingface.co/datasets/CSU-JPG/VisPrompt5M" style="text-decoration: none;">πŸ“ Dataset</a> | 
    <a href="https://huggingface.co/datasets/CSU-JPG/VPBench" style="text-decoration: none;">🌏 Benchmark</a> | 
    <a href="https://huggingface.co/CSU-JPG/FlowInOne" style="text-decoration: none;">πŸ€— Model</a>
</p>
</div>

## Authors
Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang.

## About
FlowInOne is a framework that reformulates multimodal generation as a **purely visual flow**, converting all inputs into visual prompts and enabling a clean **image-in, image-out** pipeline governed by a single flow matching model. 

This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, **unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm**.

## πŸš€ Setup

```bash
# Create conda environment
conda create -n flowinone python=3.10 -y
conda activate flowinone

# Install required packages
git clone https://github.com/CSU-JPG/FlowInOne.git
cd FlowInOne/scripts
sh setup.sh
```

## ✨ Usage

### 1. Download Weights
You can download the model weights and model preparation files using the following commands:
```bash
# model weights
wget -O checkpoints/flowinone_256px.pth https://huggingface.co/CSU-JPG/FlowInOne/resolve/main/flowinone_256px.pth

# model preparation
wget https://huggingface.co/CSU-JPG/FlowInOne/resolve/main/preparation.tar.gz
tar -xzvf "preparation.tar.gz"
```

### 2. Inference
Run inference with the provided script in the repository:
```bash
sh scripts/inference.sh
```

Our training and inference scripts are fully available on [GitHub](https://github.com/CSU-JPG/FlowInOne).

## Citation

If you found our work useful, please consider citing:
```bibtex
@article{yi2026flowinoneunifyingmultimodalgenerationimagein,
      title={FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching}, 
      author={Junchao Yi and Rui Zhao and Jiahao Tang and Weixian Lei and Linjie Li and Qisheng Su and Zhengyuan Yang and Lijuan Wang and Xiaofeng Zhu and Alex Jinpeng Wang},    
      journal={arXiv preprint arXiv:2604.06757},
      year={2026}
}
```