Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,92 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
# MusicInfuser
|
| 5 |
+
[](https://susunghong.github.io/MusicInfuser/)
|
| 6 |
+
[](https://arxiv.org/abs/2503.14505)
|
| 7 |
+
|
| 8 |
+
MusicInfuser adapts a text-to-video diffusion model to align with music, generating dance videos according to the music and text prompts.
|
| 9 |
+
|
| 10 |
+
## Requirements
|
| 11 |
+
|
| 12 |
+
We have tested on Python 3.10 with `torch>=2.4.1+cu118`, `torchaudio>=2.4.1+cu118`, and `torchvision>=0.19.1+cu118`. This repository requires a single A100 GPU for training and inference.
|
| 13 |
+
|
| 14 |
+
## Installation
|
| 15 |
+
```bash
|
| 16 |
+
# Clone the repository
|
| 17 |
+
git clone https://github.com/SusungHong/MusicInfuser
|
| 18 |
+
cd MusicInfuser
|
| 19 |
+
|
| 20 |
+
# Create and activate conda environment
|
| 21 |
+
conda create -n musicinfuser python=3.10
|
| 22 |
+
conda activate musicinfuser
|
| 23 |
+
|
| 24 |
+
# Install dependencies
|
| 25 |
+
pip install -r requirements.txt
|
| 26 |
+
pip install -e ./mochi --no-build-isolation
|
| 27 |
+
|
| 28 |
+
# Download model weights
|
| 29 |
+
python ./music_infuser/download_weights.py weights/
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## Inference
|
| 33 |
+
To generate videos from music inputs:
|
| 34 |
+
```bash
|
| 35 |
+
python inference.py --input-file {MP3 or MP4 to extract audio from} \
|
| 36 |
+
--prompt {prompt} \
|
| 37 |
+
--num-frames {number of frames}
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
with the following arguments:
|
| 41 |
+
- `--input-file`: Input file (MP3 or MP4) to extract audio from.
|
| 42 |
+
- `--prompt`: Prompt for the dancer generation. The more specific a prompt is, generally the better the results, but more specificity decreases the effect of audio. Default: `"a professional female dancer dancing K-pop in an advanced dance setting in a studio with a white background, captured from a front view"`
|
| 43 |
+
- `--num-frames`: Number of frames to generate. While originally trained with 73 frames, MusicInfuser can extrapolate to longer sequences. Default: `145`
|
| 44 |
+
|
| 45 |
+
also consider:
|
| 46 |
+
- `--seed`: Random seed for generation. The resulting dance also depends on the random seed, so feel free to change it. Default: `None`
|
| 47 |
+
- `--cfg-scale`: Classifier-Free Guidance (CFG) scale for the text prompt. Default: `6.0`
|
| 48 |
+
|
| 49 |
+
## Dataset
|
| 50 |
+
For the AIST dataset, please see the terms of use and download it at [the AIST Dance Video Database](https://aistdancedb.ongaaccel.jp/).
|
| 51 |
+
|
| 52 |
+
## Training
|
| 53 |
+
To train the model on your dataset:
|
| 54 |
+
|
| 55 |
+
1. Preprocess your data:
|
| 56 |
+
```bash
|
| 57 |
+
bash music_infuser/preprocess.bash -v {dataset path} -o {processed video output dir} -w {path to pretrained mochi} --num_frames {number of frames}
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
2. Run training:
|
| 61 |
+
```bash
|
| 62 |
+
bash music_infuser/run.bash -c music_infuser/configs/music_infuser.yaml -n 1
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
**Note:** The current implementation only supports single-GPU training, which requires approximately 80GB of VRAM to train with 73-frame sequences.
|
| 66 |
+
|
| 67 |
+
## VLM Evaluation
|
| 68 |
+
For evaluating the model using Visual Language Models:
|
| 69 |
+
|
| 70 |
+
1. Follow the instructions in `vlm_eval/README.md` to set up the VideoLLaMA2 evaluation framework
|
| 71 |
+
2. It is recommended to use a separate environment from MusicInfuser for the evaluation
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
## Citation
|
| 75 |
+
|
| 76 |
+
```bibtex
|
| 77 |
+
@article{hong2025musicinfuser,
|
| 78 |
+
title={MusicInfuser: Making Video Diffusion Listen and Dance},
|
| 79 |
+
author={Hong, Susung and Kemelmacher-Shlizerman, Ira and Curless, Brian and Seitz, Steven M},
|
| 80 |
+
journal={arXiv preprint arXiv:2503.14505},
|
| 81 |
+
year={2025}
|
| 82 |
+
}
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
## Acknowledgements
|
| 86 |
+
|
| 87 |
+
This code builds upon the following awesome repositories:
|
| 88 |
+
- [Mochi](https://github.com/genmoai/mochi)
|
| 89 |
+
- [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2)
|
| 90 |
+
- [VideoChat2](https://github.com/OpenGVLab/Ask-Anything)
|
| 91 |
+
|
| 92 |
+
We thank the authors for open-sourcing their code and models, which made this work possible.
|