Image-to-Video
Diffusers
Safetensors
English
Chinese
video generation
conversational video generation
talking human video generation
Instructions to use MeiGen-AI/MeiGen-MultiTalk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use MeiGen-AI/MeiGen-MultiTalk with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("MeiGen-AI/MeiGen-MultiTalk", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse filesadd method description
README.md
CHANGED
|
@@ -11,7 +11,7 @@ pipeline_tag: image-to-video
|
|
| 11 |
---
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
-
<img src="assets/logo2.jpeg" alt="MultiTalk" width="
|
| 15 |
</p>
|
| 16 |
|
| 17 |
# MeiGen-MultiTalk • Audio-Driven Multi-Person Conversational Video Generation
|
|
@@ -56,6 +56,11 @@ This repository hosts the model weights for **MultiTalk**. For installation, usa
|
|
| 56 |
|
| 57 |
|
| 58 |
## Method
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
<p align="left"><img src="assets/pipe.png" width="80%"></p>
|
| 60 |
|
| 61 |
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
+
<img src="assets/logo2.jpeg" alt="MultiTalk" width="300"/>
|
| 15 |
</p>
|
| 16 |
|
| 17 |
# MeiGen-MultiTalk • Audio-Driven Multi-Person Conversational Video Generation
|
|
|
|
| 56 |
|
| 57 |
|
| 58 |
## Method
|
| 59 |
+
We propose a novel framework, MultiTalk, for audio-driven multi-person conversational video generation. We investigate several schemes for audio injection and introduce
|
| 60 |
+
the Label Rotary Position Embedding (L-RoPE) method. By assigning identical labels to audio embeddings and video latents, it effectively activates specific regions within the audio cross-attention
|
| 61 |
+
map, thereby resolving incorrect binding issues. To localize the region of the specified person, we introduce the adaptive person localization by computing the similarity
|
| 62 |
+
between the features of the given region of a person in the reference image and all the features of the whole video.
|
| 63 |
+
|
| 64 |
<p align="left"><img src="assets/pipe.png" width="80%"></p>
|
| 65 |
|
| 66 |
|