Improve model card for Bifrost-1
#2
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,12 +1,83 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
Bifrost-1
|
| 4 |
-
|
| 5 |
-
-
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
<br>
|
| 9 |
<img width="800" src="teaser.png"/>
|
| 10 |
<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-to-image
|
| 3 |
+
library_name: transformers
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
---
|
| 6 |
|
| 7 |
+
# Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
|
| 8 |
+
|
| 9 |
+
This repository contains the pretrained checkpoints for **Bifrost-1**, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables. Bifrost-1 enables high-fidelity controllable image generation with significant training efficiency without compromising the strong reasoning capabilities of MLLMs.
|
| 10 |
+
|
| 11 |
+
**Paper**: [Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents](https://huggingface.co/papers/2508.05954)
|
| 12 |
+
**Project Page**: [https://bifrost-1.github.io](https://bifrost-1.github.io)
|
| 13 |
+
**GitHub Repository**: [https://github.com/hanlincs/Bifrost-1](https://github.com/hanlincs/Bifrost-1)
|
| 14 |
+
|
| 15 |
+
Bifrost-1 is designed for:
|
| 16 |
+
- **High-Fidelity Generation**: Patch-level CLIP latents are natively aligned with the MLLM visual encoder, enabling high-quality image generation.
|
| 17 |
+
- **Training Efficiency**: Achieves better image generation quality over other architecture variants with non-MLLM-aligned visual features, under controlled experimental settings, with substantially lower compute during training.
|
| 18 |
+
- **Preserves Visual Reasoning**: Bifrost-1 fully inherits strong visual understanding capabilities of backbone MLLM by equipping it with a visual generation branch initialized from the original MLLM parameters.
|
| 19 |
|
| 20 |
<br>
|
| 21 |
<img width="800" src="teaser.png"/>
|
| 22 |
<br>
|
| 23 |
+
<img width="800" src="https://github.com/hanlincs/Bifrost-1/raw/main/assets/bifrost_model_architecture.png"/>
|
| 24 |
+
<br>
|
| 25 |
+
|
| 26 |
+
## ๐ง Environment Setup
|
| 27 |
+
|
| 28 |
+
```shell
|
| 29 |
+
conda create -n bifrost1 python==3.11
|
| 30 |
+
conda activate bifrost1
|
| 31 |
+
pip install -r requirements.txt
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## ๐ฎ Inference
|
| 35 |
+
|
| 36 |
+
### ๐ Model Checkpoints
|
| 37 |
+
|
| 38 |
+
The model checkpoint can be downloaded from HuggingFace [here](https://huggingface.co/hanlincs/Bifrost-1).
|
| 39 |
+
|
| 40 |
+
You can download it to your specified `local_dir` with code:
|
| 41 |
+
```python
|
| 42 |
+
from huggingface_hub import snapshot_download
|
| 43 |
+
|
| 44 |
+
snapshot_download(
|
| 45 |
+
repo_id="hanlincs/Bifrost-1",
|
| 46 |
+
repo_type="model",
|
| 47 |
+
local_dir="xxxxxxxx", # Replace with your local directory path
|
| 48 |
+
local_dir_use_symlinks=False
|
| 49 |
+
)
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### ๐ Run Inference Scripts
|
| 53 |
+
|
| 54 |
+
Generate images from GenEval prompts
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
python inference_geneval_dpgbench.py --eval_geneval --output_dir "./outputs" --local_checkpoint_path XXXXX # Replace XXXXX with your local checkpoint path
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
## ๐ BibTeX
|
| 61 |
+
|
| 62 |
+
๐ Please let us know in the issues or PRs if there's any questions. If you find our project useful in your research or application development, citing our paper would be the best support for us!
|
| 63 |
+
|
| 64 |
+
```bibtex
|
| 65 |
+
@misc{lin2025bifrost1bridgingmultimodalllms,
|
| 66 |
+
title={Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents},
|
| 67 |
+
author={Han Lin and Jaemin Cho and Amir Zadeh and Chuan Li and Mohit Bansal},
|
| 68 |
+
year={2025},
|
| 69 |
+
eprint={2508.05954},
|
| 70 |
+
archivePrefix={arXiv},
|
| 71 |
+
primaryClass={cs.CV},
|
| 72 |
+
url={https://arxiv.org/abs/2508.05954},
|
| 73 |
+
}
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
## ๐ Acknowledgements
|
| 77 |
+
The development of Bifrost-1 has been greatly inspired by the following amazing works and teams:
|
| 78 |
+
|
| 79 |
+
- [BLIP3o](https://github.com/JiuhaiChen/BLIP3o)
|
| 80 |
+
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
|
| 81 |
+
- [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
|
| 82 |
|
| 83 |
+
We hope that releasing this model/codebase helps the community to continue pushing these creative tools forward in an open and responsible way.
|