metadata
license: apache-2.0
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: any-to-any
library_name: bagel-mot
๐ฅฏ BAGEL: Unified Model for Multimodal Understanding and Generation
We present BAGEL, an openโsource multimodal foundation model with 7B active parameters (14B total) trained on largeโscale interleaved multimodal data.
BAGEL outperforms leading openโsource VLMs like Qwen2.5-VL and InternVL-2.5 on standard benchmarks and delivers textโtoโimage quality competitive with specialist generators such as SD3.
It supports:
- Free-form visual manipulation
- Multiview synthesis
- World navigation
- Advanced image editing beyond traditional models
๐ง Installation & Usage
Please refer to our GitHub Repository for:
- Setup instructions
- Example scripts
- Demo usage
๐ง Method
BAGEL uses a Mixture-of-Transformer-Experts (MoT) architecture with:
- Dual encoders: capturing pixel-level and semantic-level features
- Training objective: Next Group of Token Prediction
- Vision token compression via FLUX.1 VAE
๐ฑ Emerging Properties
Performance improves as pretraining scales, progressing from:
- Multimodal understanding
- Generation
- Basic image editing
- Advanced multimodal reasoning and 3D/world modeling
๐ Benchmarks
๐ผ๏ธ Visual Understanding
| Model | MME โ | MMBench โ | MMMU โ | MM-Vet โ | MathVista โ |
|---|---|---|---|---|---|
| Janus-Pro-7B | โ | 79.2 | 41.0 | 50.0 | โ |
| Qwen2.5-VL-7B | 2347 | 83.5 | 58.6 | 67.1 | 68.2 |
| BAGEL | 2388 | 85.0 | 55.3 | 67.2 | 73.1 |
๐๏ธ Text-to-Image Generation (GenEval)
| Model | Overall โ |
|---|---|
| FLUX-1-dev | 0.82 |
| SD3-Medium | 0.74 |
| Janus-Pro-7B | 0.80 |
| BAGEL | 0.88 |
๐ช Image Editing
| Model | GEdit-Bench-EN (SC) โ | GEdit-Bench-EN (PQ) โ | GEdit-Bench-EN (O) โ | IntelligentBench โ |
|---|---|---|---|---|
| Step1X-Edit | 7.09 | 6.76 | 6.70 | 14.9 |
| Gemini-2-exp. | 6.73 | 6.61 | 6.32 | 57.6 |
| BAGEL | 7.36 | 6.83 | 6.52 | 44.0 |
| BAGEL+CoT | โ | โ | โ | 55.3 |
โ๏ธ License
BAGEL is licensed under the Apache 2.0 License.
Finetuned from:
๐ Citation
@article{deng2025bagel,
title = {Emerging Properties in Unified Multimodal Pretraining},
author = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
journal = {arXiv preprint arXiv:2505.14683},
year = {2025}
}