Any-to-Any
Bagel
Safetensors
bagel
BAGEL-7B-MoT / README.md
delia1212fd1f's picture
Update README.md
7ad87ba verified
|
raw
history blame
4.84 kB
metadata
license: apache-2.0
base_model:
  - Qwen/Qwen2.5-7B-Instruct
pipeline_tag: any-to-any
library_name: bagel-mot

BAGEL

๐Ÿฅฏ BAGEL: Unified Model for Multimodal Understanding and Generation


We present BAGEL, an openโ€‘source multimodal foundation model with 7B active parameters (14B total) trained on largeโ€‘scale interleaved multimodal data.

BAGEL outperforms leading openโ€‘source VLMs like Qwen2.5-VL and InternVL-2.5 on standard benchmarks and delivers textโ€‘toโ€‘image quality competitive with specialist generators such as SD3.

It supports:

  • Free-form visual manipulation
  • Multiview synthesis
  • World navigation
  • Advanced image editing beyond traditional models

๐Ÿ”ง Installation & Usage

Please refer to our GitHub Repository for:

  • Setup instructions
  • Example scripts
  • Demo usage


๐Ÿง  Method

BAGEL uses a Mixture-of-Transformer-Experts (MoT) architecture with:

  • Dual encoders: capturing pixel-level and semantic-level features
  • Training objective: Next Group of Token Prediction
  • Vision token compression via FLUX.1 VAE


๐ŸŒฑ Emerging Properties

Performance improves as pretraining scales, progressing from:

  • Multimodal understanding
  • Generation
  • Basic image editing
  • Advanced multimodal reasoning and 3D/world modeling

๐Ÿ“Š Benchmarks

๐Ÿ–ผ๏ธ Visual Understanding

Model MME โ†‘ MMBench โ†‘ MMMU โ†‘ MM-Vet โ†‘ MathVista โ†‘
Janus-Pro-7B โ€“ 79.2 41.0 50.0 โ€“
Qwen2.5-VL-7B 2347 83.5 58.6 67.1 68.2
BAGEL 2388 85.0 55.3 67.2 73.1

๐Ÿ–Œ๏ธ Text-to-Image Generation (GenEval)

Model Overall โ†‘
FLUX-1-dev 0.82
SD3-Medium 0.74
Janus-Pro-7B 0.80
BAGEL 0.88

๐Ÿช„ Image Editing

Model GEdit-Bench-EN (SC) โ†‘ GEdit-Bench-EN (PQ) โ†‘ GEdit-Bench-EN (O) โ†‘ IntelligentBench โ†‘
Step1X-Edit 7.09 6.76 6.70 14.9
Gemini-2-exp. 6.73 6.61 6.32 57.6
BAGEL 7.36 6.83 6.52 44.0
BAGEL+CoT โ€“ โ€“ โ€“ 55.3

โš–๏ธ License

BAGEL is licensed under the Apache 2.0 License.

Finetuned from:


๐Ÿ“š Citation

@article{deng2025bagel,
  title   = {Emerging Properties in Unified Multimodal Pretraining},
  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
  journal = {arXiv preprint arXiv:2505.14683},
  year    = {2025}
}