Alex Ergasti commited on
Commit
9ca6da9
·
1 Parent(s): 88fb8a2
Files changed (1) hide show
  1. README.md +149 -14
README.md CHANGED
@@ -1,14 +1,149 @@
1
- ---
2
- title: R FLAV
3
- emoji: 🐢
4
- colorFrom: blue
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.20.1
8
- app_file: app.py
9
- pinned: false
10
- license: cc-by-nc-4.0
11
- short_description: 'R-FLAV: infinite Audio Video generation with flow matching'
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # $^R$-FLAV: Rolling Flow matching for infinite Audio Video generation
2
+
3
+ This is the official implementation of
4
+
5
+ An overview of our models is shown here: $^R$-FLAV: Rolling Flow mathcing for infinite Audio Video generation.
6
+
7
+ <center><img src="imgs/FLAV.png" alt="drawing" width="500"/></center>
8
+
9
+
10
+ In our paper we explores three different model configuration, illustrated here:
11
+ <center><img src="imgs/blocks.png" alt="drawing" width="800"/></center>
12
+
13
+ ## Results
14
+
15
+ ### Examples of short video generated on AIST and Landscape
16
+
17
+ https://github.com/user-attachments/assets/d7d544d0-7c62-4870-b783-4f0efa8eebee
18
+
19
+ https://github.com/user-attachments/assets/aa6e0dfa-cbee-4127-b4e6-c96386cc0870
20
+
21
+ https://github.com/user-attachments/assets/027f8a5a-ba7f-404b-863b-f3fabbcad9a6
22
+
23
+ https://github.com/user-attachments/assets/0cbbdd84-393d-4d7b-af82-537a4398d2d1
24
+
25
+
26
+ ### Examples of long video generated on AIST
27
+
28
+ https://github.com/user-attachments/assets/233661cd-1cc0-4759-83be-faff0c988151
29
+
30
+ https://github.com/user-attachments/assets/5223acf3-04bc-4d34-924d-c7483e07f1e2
31
+
32
+ ## Setup
33
+
34
+ Create conda env:
35
+ ```bash
36
+ conda create -y -n FLAV python=3.12
37
+ conda activate FLAV
38
+ conda install -y pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
39
+ pip install pysoundfile transformers diffusers einops accelerate librosa timm
40
+ pip install onnx onnxruntime onnxsim omegaconf
41
+ pip install moviepy
42
+ pip install pyav
43
+ pip install git+https://github.com/facebookresearch/segment-anything.git
44
+ ```
45
+
46
+ ## Inference
47
+ Will be published soon.
48
+
49
+ Command line options should be the same as the loaded model (eg. num classes, predicted frames ecc.) to avoid loading errors:
50
+ ```bash
51
+ python sample-metrics.py \
52
+ --model FLAV-B/1 \
53
+ --data-path <datapath> \
54
+ --batch-size 32 --num-classes <classes> \
55
+ --image-size 256 \
56
+ --experiment-dir <exp-dir>\
57
+ --results-dir results \
58
+ --video-length 16 \
59
+ --predict-frames 10 \
60
+ --causal-attn \
61
+ --num-videos 2048 \
62
+ --audio-scale <audio-scale> \
63
+ --num-workers 16 \
64
+ --num_timesteps 20 \
65
+ --use_sd_vae \
66
+ --ignore-cache --vocoder-ckpt <vocoder-ckpt>
67
+ ```
68
+
69
+ Where `<exp-dir>` is:
70
+ ```
71
+ └──checkpoint
72
+ └──ema.pth
73
+ ```
74
+
75
+ Where `<vocoder-ckpt>` is:
76
+ ```
77
+ └──vocoder
78
+ ├──config.json
79
+ └──vocoder.pt
80
+ ```
81
+
82
+ ## Training
83
+ ```bash
84
+ accelerate launch --multi_gpu --num_processes=... \
85
+ train.py \
86
+ --model FLAV-B/1 \
87
+ --data-path <datapath> \
88
+ --image-size 256 \
89
+ --batch-size 16 --num-classes <classes> \
90
+ --experiment-dir <experiment-dir> \
91
+ --results-dir results/ \
92
+ --sample-every 20000 \
93
+ --ckpt-every 5000 \
94
+ --log-every 100 \
95
+ --video-length 50 \
96
+ --predict-frames 10 \
97
+ --sampling logit \
98
+ --num-workers 16 \
99
+ --grad-ckpt \
100
+ --causal-attn \
101
+ --use_sd_vae \
102
+ --audio-scale <audio-scale>
103
+ ```
104
+
105
+ Where `<datapath>` is the dataset folder organised as follow:
106
+
107
+ If the dataset does not have classes:
108
+ ```
109
+ dataset-folder:
110
+ ├──train
111
+ │ ├──file1.mp4
112
+ │ ├──file2.mp4
113
+ └──test
114
+ └──file3.mp4
115
+ ```
116
+ If the dataset does have classes:
117
+ ```
118
+ dataset-folder:
119
+ ├──train
120
+ │ ├──class0
121
+ │ │ └──file1.mp4
122
+ │ └──class1
123
+ │ └──file2.mp4
124
+ └──test
125
+ ├──class0
126
+ │ └──file3.mp4
127
+ └──class1
128
+ └──file4.mp4
129
+ ```
130
+ `<classes>` is the number of classes in the dataset.
131
+
132
+ `<audio-scale>` is:
133
+ - For AIST++: `3.5009668382765917`
134
+ - For landscape: `3.0951129410195515`
135
+
136
+
137
+ ## Citation
138
+
139
+ ```
140
+ @misc{ergasti2025rflavrollingflowmatching,
141
+ title={$^R$FLAV: Rolling Flow matching for infinite Audio Video generation},
142
+ author={Alex Ergasti and Giuseppe Gabriele Tarollo and Filippo Botti and Tomaso Fontanini and Claudio Ferrari and Massimo Bertozzi and Andrea Prati},
143
+ year={2025},
144
+ eprint={2503.08307},
145
+ archivePrefix={arXiv},
146
+ primaryClass={cs.CV},
147
+ url={https://arxiv.org/abs/2503.08307},
148
+ }
149
+ ```