Pedro Cuenca commited on
Commit 路
febec22
1
Parent(s): f1801ff
Update README with some explanations and links.
Browse files
README.md
CHANGED
|
@@ -5,12 +5,32 @@ language:
|
|
| 5 |
|
| 6 |
## DALL路E mini - Generate images from text
|
| 7 |
|
| 8 |
-
Model
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
## DALL路E mini - Generate images from text
|
| 7 |
|
| 8 |
+
### Model Description
|
| 9 |
|
| 10 |
+
This is an attempt to replicate OpenAI's [DALL路E](https://openai.com/blog/dall-e/), a model capable of generating arbitrary images from a text prompt that describes the desired result.
|
| 11 |
|
| 12 |
+
This model's architecture is a simplification of the original, and leverages previous open source efforts and available pre-trained models. Results have lower quality than OpenAI's, but the model can be trained and used on less demanding hardware. Our training was performed on a single TPU v3-8 for a few days.
|
| 13 |
|
| 14 |
+
### Components of the Architecture
|
| 15 |
|
| 16 |
+
The system relies in the Flax/JAX infrastructure, which are ideal for TPU training. TPUs are not required, both Flax and JAX run very efficiently on GPU backends.
|
| 17 |
+
|
| 18 |
+
The main components of the architecture include:
|
| 19 |
+
|
| 20 |
+
* An encoder, based on [BART](https://arxiv.org/abs/1910.13461). The encoder's mission is to transform a sequence of input text tokens to a sequence of image tokens. The input tokens are extracted from the text prompt by using the model's tokenizer. The image tokens are a fixed-length sequence, and they represent indices in a VQGAN-based pre-trained codebook.
|
| 21 |
+
|
| 22 |
+
* A decoder, with converts the image tokens to an image for visualization. As mentioned above, the decoder is based on a [VQGAN model](https://compvis.github.io/taming-transformers/).
|
| 23 |
+
|
| 24 |
+
The model definition we use for the encoder can be downloaded from our [Github repo](https://github.com/borisdayma/dalle-mini). The encoder is reprensented by the class `CustomFlaxBartForConditionalGeneration`.
|
| 25 |
+
|
| 26 |
+
To use the decoder, you need to follow the instructions in our accompanying VQGAN model in the hub, [flax-community/vqgan_f16_16384](https://huggingface.co/flax-community/vqgan_f16_16384).
|
| 27 |
+
|
| 28 |
+
### How to Use
|
| 29 |
+
|
| 30 |
+
The easiest way to get familiar with the code and the models is to follow the inference notebook we provide in our [github repo](https://github.com/borisdayma/dalle-mini/blob/main/dev/inference/inference_pipeline.ipynb). For your convenience, you can open it in Google Colaboratory: [](https://colab.research.google.com/github/borisdayma/dalle-mini/blob/main/dev/inference/inference_pipeline.ipynb).
|
| 31 |
+
|
| 32 |
+
If you just want to test the trained model and see what it comes up with, please visit [our demo](https://huggingface.co/spaces/flax-community/dalle-mini), available as a Space in huggingface's hub.
|
| 33 |
+
|
| 34 |
+
### Additional Details
|
| 35 |
+
|
| 36 |
+
Our [report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA) contains a lot of details about how the model was trained and shows many examples that demonstrate its capabilities.
|