Model card
Browse files
README.md
CHANGED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
inference: false
|
| 3 |
+
tags:
|
| 4 |
+
- text-to-video
|
| 5 |
+
- text-to-image
|
| 6 |
+
pipeline_tag: text-to-video
|
| 7 |
+
datasets:
|
| 8 |
+
- TempoFunk/tempofunk-sdance
|
| 9 |
+
- TempoFunk/small
|
| 10 |
+
- TempoFunk/map
|
| 11 |
+
license: agpl-3.0
|
| 12 |
+
language: en
|
| 13 |
+
library_name: diffusers
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
# Make-A-Video SD JAX Model Card
|
| 18 |
+
|
| 19 |
+
**A latent diffusion model for text-to-video synthesis.**
|
| 20 |
+
|
| 21 |
+
**[Try it with an interactive demo on HuggingFace spaces.](https://huggingface.co/spaces/TempoFunk/makeavid-sd-jax)**
|
| 22 |
+
|
| 23 |
+
Training code, PyTorch and FLAX implementation are available here: <https://github.com/lopho/makeavid-sd-tpu>
|
| 24 |
+
|
| 25 |
+
This model extends an inpainting LDM image generation model ([Stable Diffusion v1.5 Inpaint](https://huggingface.co/runwayml/stable-diffusion-inpainting))
|
| 26 |
+
with temporal convolution and temporal self-attention ported from [Make-A-Video PyTorch](https://github.com/lucidrains/make-a-video-pytorch)
|
| 27 |
+
|
| 28 |
+
It has then been fine tuned for ~150k steps on a [dataset](https://huggingface.co/datasets/TempoFunk/tempofunk-sdance) of 10,000 videos themed around dance.
|
| 29 |
+
Then for an additional ~50k steps with [extra data](https://huggingface.co/datasets/TempoFunk/small) of generic videos mixed into the original set.
|
| 30 |
+
|
| 31 |
+
This model used weights pretrained by [lxj616](https://huggingface.co/lxj616/make-a-stable-diffusion-video-timelapse) on 286 timelapse video clips for initialization.
|
| 32 |
+
|
| 33 |
+

|
| 34 |
+
|
| 35 |
+
## Table of Contents
|
| 36 |
+
|
| 37 |
+
- [Model Details](#model-details)
|
| 38 |
+
- [Uses](#uses)
|
| 39 |
+
- [Limitations](#limitations)
|
| 40 |
+
- [Training](#training)
|
| 41 |
+
- [Training Data](#training-data)
|
| 42 |
+
- [Training Process](#training-process)
|
| 43 |
+
- [Hyper parameters](#hyperparameters)
|
| 44 |
+
- [Acknowledgements](#acknowledgements-and-Citations)
|
| 45 |
+
- [Citation](#citation)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
## Model Details
|
| 49 |
+
|
| 50 |
+
* **Developed by:** [Lopho](https://huggingface.co/lopho), [Chavinlo](https://huggingface.co/chavinlo)
|
| 51 |
+
* **Model type:** Diffusion based text-to-video generation model
|
| 52 |
+
* **Language(s):** English
|
| 53 |
+
* **License:** (pending) GNU Affero General Public License 3.0
|
| 54 |
+
* **Further resources:** [Model implementation & training code](https://github.com/lopho/makeavid-sd-tpu), [Weights & Biases training statistics](https://wandb.ai/tempofunk/makeavid-sd-tpu)
|
| 55 |
+
|
| 56 |
+
## Uses
|
| 57 |
+
|
| 58 |
+
* Understanding limitations and biases of generative video models
|
| 59 |
+
* Development of educational or creative tools
|
| 60 |
+
* Artistic usage
|
| 61 |
+
* What ever you want
|
| 62 |
+
|
| 63 |
+
## Limitations
|
| 64 |
+
|
| 65 |
+
* Limited knowledge of temporal concepts not seen during training (see linked datasets)
|
| 66 |
+
* Emerging flashing lights, most likely due to training on dance videos, which include many scenes with bright, neon and flashing lights
|
| 67 |
+
* The model has only been trained with English captions and will not perform as well in other languages
|
| 68 |
+
|
| 69 |
+
## Training
|
| 70 |
+
|
| 71 |
+
### Training Data
|
| 72 |
+
|
| 73 |
+
* [S(mall)dance](https://huggingface.co/datasets/TempoFunk/tempofunk-sdance): 10,000 video-caption pairs of dancing videos (as encoded image latents, text embeddings and metadata).
|
| 74 |
+
* [small](https://huggingface.co/datasets/TempoFunk/small): 7,000 video-caption pairs of general videos (as encoded image latents, text embeddings and metadata).
|
| 75 |
+
* [Mapping](https://huggingface.co/datasets/TempoFunk/map): Video source urls for above datasets
|
| 76 |
+
|
| 77 |
+
### Training Procedure
|
| 78 |
+
|
| 79 |
+
* From each video sample a random range of 24 frames is selected
|
| 80 |
+
* Each video latent is encoded into latent representations of the shape 4 x 24 x H/8 x W/8
|
| 81 |
+
* The latent of the first frame from each video is repeated along the frame dimension as additional guidance (referred to as hint image)
|
| 82 |
+
* Hint latent and video latent are stacked to produce a shape of 8 x 24 x H/8 x W/8
|
| 83 |
+
* The last input channel is preserved for maskin purposes (not used during training, set to zero)
|
| 84 |
+
* Text prompts are encoded by the CLIP text encoder
|
| 85 |
+
* Video latents with added noise and clip encoded text prompts are fed into the UNet to predict the added noise
|
| 86 |
+
* Loss is the reconstruction objective between the added noise and the predicted noise via mean squared error (mse/l2)
|
| 87 |
+
|
| 88 |
+
### Hyperparameters
|
| 89 |
+
|
| 90 |
+
* **Batch size:** 1 x 4
|
| 91 |
+
* **Image size:** 512 x 512
|
| 92 |
+
* **Frame count:** 24
|
| 93 |
+
* **Schedule:**
|
| 94 |
+
* 2 x 10 epochs: LR warmup for 2 epochs then held constant at 5e-5 (10,000 samples per ep)
|
| 95 |
+
* 2 x 20 epochs: LR warmup for 2 epochs then held constant at 5e-5 (10,000 samples per ep)
|
| 96 |
+
* 1 x 9 epochs: LR warmup for 1 epoch to 5e-5 then cosine annealing to 1e-8
|
| 97 |
+
* Additional data mixed in, see [Trainig Data](#training-data)
|
| 98 |
+
* 1 x 5 epochs: LR warmup for 1 epochs to 2.5e-5 then constant (17,000 samples per ep)
|
| 99 |
+
* 1 x 5 epochs: LR warmup for 0.25 epochs to 5e-6 then cosine annealing to 2.5e-6 (17,000 samples per ep)
|
| 100 |
+
* some restarts were required due to NaNs appearing in the gradient (see training logs)
|
| 101 |
+
* **Total update steps:** ~200,000
|
| 102 |
+
* **Hardware:** 4 x TPUv4 (provided by Google Cloud for the [HuggingFace JAX/Diffusers Sprint Event](https://github.com/huggingface/community-events/tree/main/jax-controlnet-sprint))
|
| 103 |
+
|
| 104 |
+
Trainig statistics are available at [Weights and Biases](https://wandb.ai/tempofunk/makeavid-sd-tpu).
|
| 105 |
+
|
| 106 |
+
## Acknowledgements
|
| 107 |
+
|
| 108 |
+
* [CompViz](https://github.com/CompVis/) for [Latent Diffusion Models]() + [Stable Diffusion]()
|
| 109 |
+
* [Meta AIs Make-A-Video](https://arxiv.org/abs/2209.14792) for the research of applying pseudo 3D convolution and attention to existing image models
|
| 110 |
+
* [Phil Wang](https://github.com/lucidrains) for the torch implementation of [Make-A-Video Pseudo3D convolution and attention](https://github.com/lucidrains/make-a-video-pytorch/)
|
| 111 |
+
* [lxj616](https://huggingface.co/lxj616) for initial proof of feasibility of LDM + Make-A-Video
|
| 112 |
+
|
| 113 |
+
## Citation
|
| 114 |
+
|
| 115 |
+
```bibtext
|
| 116 |
+
@misc{TempoFunk2023,
|
| 117 |
+
author = {Lopho, Chavinlo},
|
| 118 |
+
title = {TempoFunk: Extending LDM models to Video},
|
| 119 |
+
url = {https://github.com/lopho/makeavid-sd-tpu},
|
| 120 |
+
month = {5},
|
| 121 |
+
year = {2023}
|
| 122 |
+
}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
*This model card was written by: [Lopho](https://hugginface.co/lopho), [Chavinlo](https://huggingface.co/chavinlo), [Julian Herrera](https://huggingface.co/puffy310) and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*
|