license: apache-2.0
datasets:
- CSU-JPG/VisPrompt5M
language:
- en
metrics:
- code_eval
pipeline_tag: image-to-image
FlowInOne: Unifying Multimodal Generation as Image-in, Image-out Flow Matching
TL;DR: The first vision-centric image-in, image-out image generation model.
π Homepage | π» Code | π Paper | π Dataset | π Benchmark | π€ Model
About
We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm.Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.
π§ͺ Usage
Our training and inference scripts are now available on GitHub!