Add library name and link to project page

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -1,7 +1,134 @@
1
  ---
2
- language:
3
- - en
4
  base_model:
5
  - stabilityai/stable-diffusion-xl-base-1.0
 
 
6
  pipeline_tag: image-to-image
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  base_model:
3
  - stabilityai/stable-diffusion-xl-base-1.0
4
+ language:
5
+ - en
6
  pipeline_tag: image-to-image
7
+ library_name: diffusers
8
+ ---
9
+
10
+ # Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
11
+
12
+ This repository contains the LoRA weights for the Hummingbird model, presented in [Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment](https://huggingface.co/papers/2502.05153).
13
+ The Hummingbird model generates high-quality, diverse images from a multimodal context, preserving scene attributes and object interactions from both a reference image and text guidance.
14
+
15
+ [Project page](https://roar-ai.github.io/hummingbird) | [Paper](https://openreview.net/forum?id=6kPBThI6ZJ)
16
+
17
+ ### Official implementation of paper: [Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment](https://openreview.net/pdf?id=6kPBThI6ZJ)
18
+
19
+ ![image/png](/assets/teaser_comparison_v1.png)
20
+
21
+ ## Prerequisites
22
+
23
+ ### Installation
24
+
25
+ 1. Clone this repository and navigate to hummingbird-1 folder
26
+ ```
27
+ git clone https://github.com/roar-ai/hummingbird-1
28
+ cd hummingbird-1
29
+ ```
30
+
31
+ 2. Create `conda` virtual environment with Python 3.9, PyTorch 2.0+ is recommended:
32
+ ```
33
+ conda create -n hummingbird python=3.9
34
+ conda activate hummingbird
35
+ pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
36
+ pip install -r requirements.txt
37
+ ```
38
+
39
+ 3. Install additional packages for faster training and inference
40
+ ```
41
+ pip install flash-attn --no-build-isolation
42
+ ```
43
+
44
+ ### Download necessary models
45
+
46
+ 1. Clone our Hummingbird LoRA weight of UNet denoiser
47
+ ```
48
+ git clone https://huggingface.co/lmquan/hummingbird
49
+ ```
50
+
51
+ 2. Refer to [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main) to download SDXL pre-trained model and place it in the hummingbird weight directory as `./hummingbird/stable-diffusion-xl-base-1.0`.
52
+
53
+ 3. Download [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k/tree/main) for `feature extractor` and `image encoder` in Hummmingbird framework
54
+ ```
55
+ cp -r CLIP-ViT-bigG-14-laion2B-39B-b160k ./hummingbird/stable-diffusion-xl-base-1.0/image_encoder
56
+
57
+ mv CLIP-ViT-bigG-14-laion2B-39B-b160k ./hummingbird/stable-diffusion-xl-base-1.0/feature_extractor
58
+ ```
59
+
60
+ 4. Replace the file `model_index.json` of pre-trained `stable-diffusion-xl-base-1.0` with our customized version for Hummingbird framework
61
+ ```
62
+ cp -r ./hummingbird/model_index.json ./hummingbird/stable-diffusion-xl-base-1.0/
63
+ ```
64
+ 5. Download [HPSv2 weights](https://drive.google.com/file/d/1T4e6WqsS5lcs92HdmzQYonrfDH1Ub53T/view?usp=sharing) and put it here: `hpsv2/HPS_v2_compressed.pt`.
65
+ 6. Download [PickScore model weights](https://drive.google.com/file/d/1UhR0zFXiEI-spt2QdX67FY9a0dcqa9xy/view?usp=sharing) and put it here: `pickscore/pickmodel/model.safetensors`.
66
+
67
+ ### Double check if everything is all set
68
+ ```
69
+ |-- hummingbird-1/
70
+ |-- hpsv2
71
+ |-- HPS_v2_compressed.pt
72
+ |-- pickscore
73
+ |-- pickmodel
74
+ |-- config.json
75
+ |-- model.safetensors
76
+ |-- hummingbird
77
+ |-- model_index.json
78
+ |-- lora_unet_65000
79
+ |-- adapter_config.json
80
+ |-- adapter_model.safetensors
81
+ |-- stable-diffusion-xl-base-1.0
82
+ |-- model_index.json (replaced by our customized version, see step 4 above)
83
+ |-- feature_extractor (cloned from CLIP-ViT-bigG-14-laion2B-39B-b160k)
84
+ |-- image_encoder (cloned from CLIP-ViT-bigG-14-laion2B-39B-b160k)
85
+ |-- text_encoder
86
+ |-- text_encoder_2
87
+ |-- tokenizer
88
+ |-- tokenizer_2
89
+ |-- unet
90
+ |-- vae
91
+ |-- ...
92
+ |-- ...
93
+ ```
94
+
95
+ ## Quick Start
96
+
97
+ Given a reference image, Hummingbird can generate diverse variants of it and preserve specific properties/attributes, for example:
98
+ ```
99
+ python3 inference.py --reference_image ./examples/image-2.jpg --attribute "color of skateboard wheels" --output_path output.jpg
100
+ ```
101
+
102
+ ## Training
103
+ You can train Hummingbird with the following script:
104
+ ```
105
+ sh run_hummingbird.sh
106
+ ```
107
+
108
+ ## Synthetic Data Generation
109
+ You can generate synthetic data with Hummingbird framework, for e.g. with MME Perception dataset:
110
+
111
+ ```
112
+ python3 image_generation.py --generator hummingbird --dataset mme --save_image_gen ./synthetic_mme
113
+ ```
114
+
115
+ ## Testing
116
+ Evaluate the fidelity of generated images w.r.t reference image using Test-Time Augmentation on MLLMs (LLaVA/InternVL2):
117
+ ```
118
+ python3 test_hummingbird_mme.py --dataset mme --model llava --synthetic_dir ./synthetic_mme
119
+ ```
120
+
121
+ ## Acknowledgement
122
+ We base on the implementation of [TextCraftor](https://github.com/snap-research/textcraftor). We thank [BLIP-2 QFormer](https://github.com/salesforce/LAVIS), [HPSv2](https://github.com/tgxs002/HPSv2), [PickScore](https://github.com/yuvalkirstain/PickScore), [Aesthetic](https://laion.ai/blog/laion-aesthetics/) for the reward models and MLLMs [LLaVA](https://github.com/haotian-liu/LLaVA), [InternVL2](https://github.com/OpenGVLab/InternVL) functioning as context descriptors in our framework.
123
+
124
+ ## Citation
125
+ If you find this work helpful, please cite our paper:
126
+ ```BibTeX
127
+ @inproceedings{le2025hummingbird,
128
+ title={Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment},
129
+ author={Minh-Quan Le and Gaurav Mittal and Tianjian Meng and A S M Iftekhar and Vishwas Suryanarayanan and Barun Patra and Dimitris Samaras and Mei Chen},
130
+ booktitle={The Thirteenth International Conference on Learning Representations},
131
+ year={2025},
132
+ url={https://openreview.net/forum?id=6kPBThI6ZJ}
133
+ }
134
+ ```