File size: 4,253 Bytes
695a42e 95a4e3b 695a42e db1e23e 695a42e db1e23e c0e00c7 db1e23e c0e00c7 db1e23e c0e00c7 db1e23e c0e00c7 22ce776 5cb60dc db1e23e 22ce776 db1e23e 674551e db1e23e 25f117f b5b10f0 db1e23e b5b10f0 db1e23e 25f117f db1e23e 25f117f db1e23e 25f117f db1e23e 9d83790 db1e23e 9d83790 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | ---
datasets:
- CSU-JPG/VisPrompt5M
- CSU-JPG/VPBench
language:
- en
license: apache-2.0
pipeline_tag: image-to-image
tags:
- flow-matching
- image-generation
- image-editing
- vision-centric
---
<div align="center">
<h2 align="center" style="margin-top: 0; margin-bottom: 15px;">
<span style="color:#0052CC">F</span><span style="color:#135FD0">l</span><span style="color:#266CD4">o</span><span style="color:#3979D7">w</span><span style="color:#4C86DB">I</span><span style="color:#6093DF">n</span><span style="color:#73A0E3">O</span><span style="color:#86ADE7">n</span><span style="color:#99BAEB\">e</span>: Unifying Multimodal Generation as
<span style="color:#0052CC">I</span><span style="color:#0958CE">m</span><span style="color:#125ED0">a</span><span style="color:#1B64D2">g</span><span style="color:#246AD4">e</span><span style="color:#2D70D6">-</span><span style="color:#3676D8">i</span><span style="color:#3F7CDA\">n</span><span style="color:#4882DC">,</span> <span style="color:#5188DE">I</span><span style="color:#5A8EE0\">m</span><span style="color:#6394E2\">a</span><span style="color:#6C9AE4\">g</span><span style="color:#75A0E6\">e</span><span style="color:#7EA6E8">-</span><span style="color:#87ACEA\">o</span><span style="color:#90B2EC\">u</span><span style="color:#99B8EE\">t</span> Flow Matching
</h2>
<p align="center" style="font-size: 15px;">
<span style="color:#E74C3C; font-weight: bold;">TL;DR:</span> <strong>The first vision-centric image-in, image-out image generation model.</strong>
</p>
<p align="center" style="font-size: 16px;">
<a href="https://csu-jpg.github.io/FlowInOne.github.io/" style="text-decoration: none;">π Homepage</a> |
<a href="https://github.com/CSU-JPG/FlowInOne" style="text-decoration: none;">π» Code</a> |
<a href="https://huggingface.co/papers/2604.06757" style="text-decoration: none;">π Paper</a> |
<a href="https://huggingface.co/datasets/CSU-JPG/VisPrompt5M" style="text-decoration: none;">π Dataset</a> |
<a href="https://huggingface.co/datasets/CSU-JPG/VPBench" style="text-decoration: none;">π Benchmark</a> |
<a href="https://huggingface.co/CSU-JPG/FlowInOne" style="text-decoration: none;">π€ Model</a>
</p>
</div>
## Authors
Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang.
## About
FlowInOne is a framework that reformulates multimodal generation as a **purely visual flow**, converting all inputs into visual prompts and enabling a clean **image-in, image-out** pipeline governed by a single flow matching model.
This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, **unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm**.
## π Setup
```bash
# Create conda environment
conda create -n flowinone python=3.10 -y
conda activate flowinone
# Install required packages
git clone https://github.com/CSU-JPG/FlowInOne.git
cd FlowInOne/scripts
sh setup.sh
```
## β¨ Usage
### 1. Download Weights
You can download the model weights and model preparation files using the following commands:
```bash
# model weights
wget -O checkpoints/flowinone_256px.pth https://huggingface.co/CSU-JPG/FlowInOne/resolve/main/flowinone_256px.pth
# model preparation
wget https://huggingface.co/CSU-JPG/FlowInOne/resolve/main/preparation.tar.gz
tar -xzvf "preparation.tar.gz"
```
### 2. Inference
Run inference with the provided script in the repository:
```bash
sh scripts/inference.sh
```
Our training and inference scripts are fully available on [GitHub](https://github.com/CSU-JPG/FlowInOne).
## Citation
If you found our work useful, please consider citing:
```bibtex
@article{yi2026flowinoneunifyingmultimodalgenerationimagein,
title={FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching},
author={Junchao Yi and Rui Zhao and Jiahao Tang and Weixian Lei and Linjie Li and Qisheng Su and Zhengyuan Yang and Lijuan Wang and Xiaofeng Zhu and Alex Jinpeng Wang},
journal={arXiv preprint arXiv:2604.06757},
year={2026}
}
``` |