Update README.md
Browse filesadd method description
README.md
CHANGED
|
@@ -11,7 +11,7 @@ pipeline_tag: image-to-video
|
|
| 11 |
---
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
-
<img src="assets/logo2.jpeg" alt="MultiTalk" width="
|
| 15 |
</p>
|
| 16 |
|
| 17 |
# MeiGen-MultiTalk • Audio-Driven Multi-Person Conversational Video Generation
|
|
@@ -56,6 +56,11 @@ This repository hosts the model weights for **MultiTalk**. For installation, usa
|
|
| 56 |
|
| 57 |
|
| 58 |
## Method
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
<p align="left"><img src="assets/pipe.png" width="80%"></p>
|
| 60 |
|
| 61 |
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
+
<img src="assets/logo2.jpeg" alt="MultiTalk" width="300"/>
|
| 15 |
</p>
|
| 16 |
|
| 17 |
# MeiGen-MultiTalk • Audio-Driven Multi-Person Conversational Video Generation
|
|
|
|
| 56 |
|
| 57 |
|
| 58 |
## Method
|
| 59 |
+
We propose a novel framework, MultiTalk, for audio-driven multi-person conversational video generation. We investigate several schemes for audio injection and introduce
|
| 60 |
+
the Label Rotary Position Embedding (L-RoPE) method. By assigning identical labels to audio embeddings and video latents, it effectively activates specific regions within the audio cross-attention
|
| 61 |
+
map, thereby resolving incorrect binding issues. To localize the region of the specified person, we introduce the adaptive person localization by computing the similarity
|
| 62 |
+
between the features of the given region of a person in the reference image and all the features of the whole video.
|
| 63 |
+
|
| 64 |
<p align="left"><img src="assets/pipe.png" width="80%"></p>
|
| 65 |
|
| 66 |
|