dome272
/

Paella

Model card Files Files and versions

xet

Community

dome272 commited on Apr 12, 2023

Commit

1bc181e

1 Parent(s): a8ad3eb

Update README.md

Browse files

Files changed (1) hide show

README.md +103 -0

README.md CHANGED Viewed

@@ -1,3 +1,106 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+![](https://user-images.githubusercontent.com/61938694/231021615-38df0a0a-d97e-4f7a-99d9-99952357b4b1.png)
+## Paella
+We are releasing a new Paella model which builds on top of our initial paper https://arxiv.org/abs/2211.07292.
+Paella is a text-to-image model that works in a quantized latent space and learns similarly to MUSE and Diffusion models.
+Since the paper-release we worked intensively to bring Paella to a similar level as other
+state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not
+to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer
+to people outside the field on a technical basis. For example, many models have codebases with many thousand lines of
+code, that make it pretty hard for people to dive into the code and easily understand it. And that is the contribution
+we are the most with Paella. The training and sampling code for Paella is minimalistic and can be understood in a few
+minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire
+sampling code can be written in just **12 lines** of code.
+### How does Paella work?
+Paella works in a quantized latent space, just like StableDiffusion etc., to reduce the computational power needed.
+Images will be encoded to a smaller latent space and converted to visual tokens of shape *h x w*. Now during training,
+these visual tokens will be noised, by replacing a random amount of tokens with other randomly selected tokens
+from the codebook of the VQGAN. The noised image will be given to the model, along with a timestep and the conditional
+information, which is text in our case. The model is tasked to predict the un-noised version of the tokens.
+And that's it. The model is optimized with the CrossEntropy loss between the original tokens and the predicted tokens.
+The amount of noise added during the training is just a linear schedule, meaning that we uniformly sample a percentage
+between 0 and 100% and noise that amount of tokens.<br><br>
+<figure>
+  <img src="https://user-images.githubusercontent.com/61938694/231248435-d21170c1-57b4-4a8f-90a6-62cf3e7effcd.png" width="400">
+  <figcaption>Images are noised and then fed to the model during training.</figcaption>
+</figure>
+Sampling is also extremely simple, we start with the entire image being random tokens. Then we feed the latent image,
+the timestep and the condition into the model and let it predict the final image. The models outputs a distribution
+over every token, which we sample from with standard multinomial sampling.
+Since there are infinite possibilities for the result to look like, just doing a single step results in very basic
+shapes without any details. That is why we add noise to the image again and feed it back to the model. And we repeat
+that process for a number of times with less noise being added every time and slowly get our final image.
+You can see how images emerge [here](https://user-images.githubusercontent.com/61938694/231252449-d9ac4d15-15ef-4aed-a0de-91fa8746a415.png).<br>
+The following is the entire sampling code needed to generate images:
+```python
+def sample(model_inputs, latent_shape, unconditional_inputs, steps=12, renoise_steps=11, temperature=(0.7, 0.3), cfg=8.0):
+    with torch.inference_mode():
+        sampled = torch.randint(0, model.num_labels, size=latent_shape)
+        initial_noise = sampled.clone()
+        timesteps = torch.linspace(1.0, 0.0, steps+1)
+        temperatures = torch.linspace(temperature[0], temperature[1], steps)
+        for i, t in enumerate(timesteps[:steps]):
+            t = torch.ones(latent_shape[0]) * t
+            logits = model(sampled, t, **model_inputs)
+            if cfg:
+                logits = logits * cfg + model(sampled, t, **unconditional_inputs) * (1-cfg)
+            sampled = logits.div(temperatures[i]).softmax(dim=1).permute(0, 2, 3, 1).reshape(-1, logits.size(1))
+            sampled = torch.multinomial(sampled, 1)[:, 0].view(logits.size(0), *logits.shape[2:])
+            if i < renoise_steps:
+                t_next = torch.ones(latent_shape[0]) * timesteps[i+1]
+                sampled = model.add_noise(sampled, t_next, random_x=initial_noise)[0]
+    return sampled
+```
+### Results
+<img src="https://user-images.githubusercontent.com/61938694/231598512-2410c172-5a9d-43f4-947c-6ff7eaee77e7.png">
+Since Paella is also conditioned on CLIP image embeddings the following things are also possible:<br><br>
+<img src="https://user-images.githubusercontent.com/61938694/231278319-16551a8d-bfd1-49c9-b604-c6da3955a6d4.png">
+<img src="https://user-images.githubusercontent.com/61938694/231287637-acd0b9b2-90c7-4518-9b9e-d7edefc6c3af.png">
+<img src="https://user-images.githubusercontent.com/61938694/231287119-42fe496b-e737-4dc5-8e53-613bdba149da.png">
+### Technical Details.
+Model-Architecture: U-Net (Mix of....) <br>
+Dataset: Laion-A, Laion Aesthetic > 6.0 <br>
+Training Steps: 1.3M <br>
+Batch Size: 2048 <br>
+Resolution: 256 <br>
+VQGAN Compression: f4 <br>
+Condition: ByT5-XL (95%), CLIP-H Image Embedding (10%), CLIP-H Text Embedding (10%)
+Optimizer: AdamW
+Hardware: 128 A100 @ 80GB <br>
+Training Time: ~3 weeks <br>
+Learning Rate: 1e-4 <br>
+More details on the approach, training and sampling can be found in paper and on GitHub.
+### Paper, Code Release
+Paper: https://arxiv.org/abs/2211.07292 <br>
+Code: https://github.com/dome272/Paella <br>
+### Goal
+So you see, there are no heavy math formulas or theorems needed to achieve good sampling qualities. Moreover,
+there are no constants such as alpha, beta, alpha_cum_prod etc. necessary as in diffusion models. This makes this
+method really suitable for people new to the field of generative AI. We hope we can set the foundation for further
+research in that direction and hope to contribute to a world where AI is accessible and can be understood by everyone.
+### Limitations & Conclusion
+There are still many things to improve for Paella to get on par with standard diffusion models or to even outperform
+them. One primary thing we notice is that even though we only condition the model on CLIP image embedding 10% of the
+time, during inference the model heavily relies on the generated image embeddings by a prior model (mapping clip text
+embeddings to image embeddings as proposed in Dalle2). We counteract this by decreasing the importance of the image
+embeddings by reweighing the attention scores. There probably is a way to avoid this happening already in training.
+Other limitations such as lack of composition, text depiction, unawareness of concepts etc. could also be reduced by
+continuing the training for longer. As a reference, Paella has only seen as many images as SD 1.4 and due to earlier
+To conclude, this is still work in progress, but our first model that works a million times better than the first
+versions we trained months ago. We hope that more people become interested in this approach, since we believe it has
+a lot of potential to become much better than this and to enable new people to have an easy-to-understand introduction
+to the field of generative AI.