Instructions to use ByteDance/BindWeave with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ByteDance/BindWeave with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ByteDance/BindWeave", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
lizhaoyang commited on
Commit ยท
814246b
1
Parent(s): d4bf1d3
Update README
Browse files
README.md
CHANGED
|
@@ -48,7 +48,8 @@ license: apache-2.0
|
|
| 48 |
|
| 49 |
## ๐ Overview
|
| 50 |
BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer.
|
| 51 |
-
It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation.
|
|
|
|
| 52 |
|
| 53 |
### OpenS2V-Eval Performance ๐
|
| 54 |
BindWeave achieves a solid score of 57.61 on the [OpenS2V-Eval](https://huggingface.co/spaces/BestWishYsh/OpenS2V-Eval) benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems.
|
|
|
|
| 48 |
|
| 49 |
## ๐ Overview
|
| 50 |
BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer.
|
| 51 |
+
It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation. For more details or tutorials refer to [ByteDance/BindWeave](https://github.com/bytedance/BindWeave)
|
| 52 |
+
|
| 53 |
|
| 54 |
### OpenS2V-Eval Performance ๐
|
| 55 |
BindWeave achieves a solid score of 57.61 on the [OpenS2V-Eval](https://huggingface.co/spaces/BestWishYsh/OpenS2V-Eval) benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems.
|