Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,106 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+

|
| 5 |
+
## Paella
|
| 6 |
+
We are releasing a new Paella model which builds on top of our initial paper https://arxiv.org/abs/2211.07292.
|
| 7 |
+
Paella is a text-to-image model that works in a quantized latent space and learns similarly to MUSE and Diffusion models.
|
| 8 |
+
Since the paper-release we worked intensively to bring Paella to a similar level as other
|
| 9 |
+
state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not
|
| 10 |
+
to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer
|
| 11 |
+
to people outside the field on a technical basis. For example, many models have codebases with many thousand lines of
|
| 12 |
+
code, that make it pretty hard for people to dive into the code and easily understand it. And that is the contribution
|
| 13 |
+
we are the most with Paella. The training and sampling code for Paella is minimalistic and can be understood in a few
|
| 14 |
+
minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire
|
| 15 |
+
sampling code can be written in just **12 lines** of code.
|
| 16 |
+
|
| 17 |
+
### How does Paella work?
|
| 18 |
+
Paella works in a quantized latent space, just like StableDiffusion etc., to reduce the computational power needed.
|
| 19 |
+
Images will be encoded to a smaller latent space and converted to visual tokens of shape *h x w*. Now during training,
|
| 20 |
+
these visual tokens will be noised, by replacing a random amount of tokens with other randomly selected tokens
|
| 21 |
+
from the codebook of the VQGAN. The noised image will be given to the model, along with a timestep and the conditional
|
| 22 |
+
information, which is text in our case. The model is tasked to predict the un-noised version of the tokens.
|
| 23 |
+
And that's it. The model is optimized with the CrossEntropy loss between the original tokens and the predicted tokens.
|
| 24 |
+
The amount of noise added during the training is just a linear schedule, meaning that we uniformly sample a percentage
|
| 25 |
+
between 0 and 100% and noise that amount of tokens.<br><br>
|
| 26 |
+
|
| 27 |
+
<figure>
|
| 28 |
+
<img src="https://user-images.githubusercontent.com/61938694/231248435-d21170c1-57b4-4a8f-90a6-62cf3e7effcd.png" width="400">
|
| 29 |
+
<figcaption>Images are noised and then fed to the model during training.</figcaption>
|
| 30 |
+
</figure>
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
Sampling is also extremely simple, we start with the entire image being random tokens. Then we feed the latent image,
|
| 34 |
+
the timestep and the condition into the model and let it predict the final image. The models outputs a distribution
|
| 35 |
+
over every token, which we sample from with standard multinomial sampling.
|
| 36 |
+
Since there are infinite possibilities for the result to look like, just doing a single step results in very basic
|
| 37 |
+
shapes without any details. That is why we add noise to the image again and feed it back to the model. And we repeat
|
| 38 |
+
that process for a number of times with less noise being added every time and slowly get our final image.
|
| 39 |
+
You can see how images emerge [here](https://user-images.githubusercontent.com/61938694/231252449-d9ac4d15-15ef-4aed-a0de-91fa8746a415.png).<br>
|
| 40 |
+
The following is the entire sampling code needed to generate images:
|
| 41 |
+
```python
|
| 42 |
+
def sample(model_inputs, latent_shape, unconditional_inputs, steps=12, renoise_steps=11, temperature=(0.7, 0.3), cfg=8.0):
|
| 43 |
+
with torch.inference_mode():
|
| 44 |
+
sampled = torch.randint(0, model.num_labels, size=latent_shape)
|
| 45 |
+
initial_noise = sampled.clone()
|
| 46 |
+
timesteps = torch.linspace(1.0, 0.0, steps+1)
|
| 47 |
+
temperatures = torch.linspace(temperature[0], temperature[1], steps)
|
| 48 |
+
for i, t in enumerate(timesteps[:steps]):
|
| 49 |
+
t = torch.ones(latent_shape[0]) * t
|
| 50 |
+
|
| 51 |
+
logits = model(sampled, t, **model_inputs)
|
| 52 |
+
if cfg:
|
| 53 |
+
logits = logits * cfg + model(sampled, t, **unconditional_inputs) * (1-cfg)
|
| 54 |
+
sampled = logits.div(temperatures[i]).softmax(dim=1).permute(0, 2, 3, 1).reshape(-1, logits.size(1))
|
| 55 |
+
sampled = torch.multinomial(sampled, 1)[:, 0].view(logits.size(0), *logits.shape[2:])
|
| 56 |
+
|
| 57 |
+
if i < renoise_steps:
|
| 58 |
+
t_next = torch.ones(latent_shape[0]) * timesteps[i+1]
|
| 59 |
+
sampled = model.add_noise(sampled, t_next, random_x=initial_noise)[0]
|
| 60 |
+
return sampled
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### Results
|
| 64 |
+
<img src="https://user-images.githubusercontent.com/61938694/231598512-2410c172-5a9d-43f4-947c-6ff7eaee77e7.png">
|
| 65 |
+
Since Paella is also conditioned on CLIP image embeddings the following things are also possible:<br><br>
|
| 66 |
+
<img src="https://user-images.githubusercontent.com/61938694/231278319-16551a8d-bfd1-49c9-b604-c6da3955a6d4.png">
|
| 67 |
+
<img src="https://user-images.githubusercontent.com/61938694/231287637-acd0b9b2-90c7-4518-9b9e-d7edefc6c3af.png">
|
| 68 |
+
<img src="https://user-images.githubusercontent.com/61938694/231287119-42fe496b-e737-4dc5-8e53-613bdba149da.png">
|
| 69 |
+
|
| 70 |
+
### Technical Details.
|
| 71 |
+
Model-Architecture: U-Net (Mix of....) <br>
|
| 72 |
+
Dataset: Laion-A, Laion Aesthetic > 6.0 <br>
|
| 73 |
+
Training Steps: 1.3M <br>
|
| 74 |
+
Batch Size: 2048 <br>
|
| 75 |
+
Resolution: 256 <br>
|
| 76 |
+
VQGAN Compression: f4 <br>
|
| 77 |
+
Condition: ByT5-XL (95%), CLIP-H Image Embedding (10%), CLIP-H Text Embedding (10%)
|
| 78 |
+
Optimizer: AdamW
|
| 79 |
+
Hardware: 128 A100 @ 80GB <br>
|
| 80 |
+
Training Time: ~3 weeks <br>
|
| 81 |
+
Learning Rate: 1e-4 <br>
|
| 82 |
+
More details on the approach, training and sampling can be found in paper and on GitHub.
|
| 83 |
+
|
| 84 |
+
### Paper, Code Release
|
| 85 |
+
Paper: https://arxiv.org/abs/2211.07292 <br>
|
| 86 |
+
Code: https://github.com/dome272/Paella <br>
|
| 87 |
+
|
| 88 |
+
### Goal
|
| 89 |
+
So you see, there are no heavy math formulas or theorems needed to achieve good sampling qualities. Moreover,
|
| 90 |
+
there are no constants such as alpha, beta, alpha_cum_prod etc. necessary as in diffusion models. This makes this
|
| 91 |
+
method really suitable for people new to the field of generative AI. We hope we can set the foundation for further
|
| 92 |
+
research in that direction and hope to contribute to a world where AI is accessible and can be understood by everyone.
|
| 93 |
+
|
| 94 |
+
### Limitations & Conclusion
|
| 95 |
+
There are still many things to improve for Paella to get on par with standard diffusion models or to even outperform
|
| 96 |
+
them. One primary thing we notice is that even though we only condition the model on CLIP image embedding 10% of the
|
| 97 |
+
time, during inference the model heavily relies on the generated image embeddings by a prior model (mapping clip text
|
| 98 |
+
embeddings to image embeddings as proposed in Dalle2). We counteract this by decreasing the importance of the image
|
| 99 |
+
embeddings by reweighing the attention scores. There probably is a way to avoid this happening already in training.
|
| 100 |
+
Other limitations such as lack of composition, text depiction, unawareness of concepts etc. could also be reduced by
|
| 101 |
+
continuing the training for longer. As a reference, Paella has only seen as many images as SD 1.4 and due to earlier
|
| 102 |
+
To conclude, this is still work in progress, but our first model that works a million times better than the first
|
| 103 |
+
versions we trained months ago. We hope that more people become interested in this approach, since we believe it has
|
| 104 |
+
a lot of potential to become much better than this and to enable new people to have an easy-to-understand introduction
|
| 105 |
+
to the field of generative AI.
|
| 106 |
+
|