Add all files

Files changed (12) hide show

README.md +45 -0
config.json +71 -0
environment.yaml +10 -0
flax_model.msgpack +3 -0
img/demo_screenshot.png +0 -0
merges.txt +0 -0
pipeline.py +110 -0
requirements.txt +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,45 @@

+---
+language:
+- en
+pipeline_tag: text-to-image
+inference: false
+---
+## DALL·E mini - Generate images from text
+<img style="text-align:center; display:block;" src="https://raw.githubusercontent.com/borisdayma/dalle-mini/main/img/logo.png" width="200">
+* [Technical Report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA)
+* [Demo](https://huggingface.co/spaces/flax-community/dalle-mini)
+### Model Description
+This is an attempt to replicate OpenAI's [DALL·E](https://openai.com/blog/dall-e/), a model capable of generating arbitrary images from a text prompt that describes the desired result.
+![DALL·E mini demo screenshot](img/demo_screenshot.png)
+This model's architecture is a simplification of the original, and leverages previous open source efforts and available pre-trained models. Results have lower quality than OpenAI's, but the model can be trained and used on less demanding hardware. Our training was performed on a single TPU v3-8 for a few days.
+### Components of the Architecture
+The system relies on the Flax/JAX infrastructure, which are ideal for TPU training. TPUs are not required, both Flax and JAX run very efficiently on GPU backends.
+The main components of the architecture include:
+* An encoder, based on [BART](https://arxiv.org/abs/1910.13461). The encoder transforms a sequence of input text tokens to a sequence of image tokens. The input tokens are extracted from the text prompt by using the model's tokenizer. The image tokens are a fixed-length sequence, and they represent indices in a VQGAN-based pre-trained codebook.
+* A decoder, which converts the image tokens to image pixels. As mentioned above, the decoder is based on a [VQGAN model](https://compvis.github.io/taming-transformers/).
+The model definition we use for the encoder can be downloaded from our [Github repo](https://github.com/borisdayma/dalle-mini). The encoder is represented by the class `CustomFlaxBartForConditionalGeneration`.
+To use the decoder, you need to follow the instructions in our accompanying VQGAN model in the hub, [flax-community/vqgan_f16_16384](https://huggingface.co/flax-community/vqgan_f16_16384).
+### How to Use
+The easiest way to get familiar with the code and the models is to follow the inference notebook we provide in our [github repo](https://github.com/borisdayma/dalle-mini/blob/main/dev/inference/inference_pipeline.ipynb). For your convenience, you can open it in Google Colaboratory: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/dalle-mini/blob/main/dev/inference/inference_pipeline.ipynb)
+If you just want to test the trained model and see what it comes up with, please visit [our demo](https://huggingface.co/spaces/flax-community/dalle-mini), available in 🤗 Spaces.
+### Additional Details
+Our [report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA) contains more details about how the model was trained and shows many examples that demonstrate its capabilities.

config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "_num_labels": 3,
+  "activation_dropout": 0.0,
+  "activation_function": "gelu",
+  "add_final_layer_norm": false,
+  "architectures": [
+    "omFlaxBartForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 16384,
+  "classif_dropout": 0.0,
+  "classifier_dropout": 0.0,
+  "d_model": 1024,
+  "decoder_attention_heads": 16,
+  "decoder_ffn_dim": 4096,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 12,
+  "decoder_start_token_id": 16384,
+  "dropout": 0.1,
+  "early_stopping": true,
+  "encoder_attention_heads": 16,
+  "encoder_ffn_dim": 4096,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 12,
+  "eos_token_id": 16385,
+  "force_bos_token_to_be_generated": false,
+  "forced_eos_token_id": null,
+  "gradient_checkpointing": false,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "length_penalty": 2.0,
+  "max_length": 257,
+  "max_position_embeddings": 1024,
+  "max_position_embeddings_decoder": 257,
+  "min_length": 257,
+  "model_type": "bart",
+  "no_repeat_ngram_size": 3,
+  "normalize_before": false,
+  "num_beams": 4,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "pos_token_id": 16384,
+  "prefix": " ",
+  "scale_embedding": false,
+  "task_specific_params": {
+    "summarization": {
+      "early_stopping": true,
+      "length_penalty": 2.0,
+      "max_length": 142,
+      "min_length": 56,
+      "no_repeat_ngram_size": 3,
+      "num_beams": 4
+    }
+  },
+  "tie_word_embeddings": false,
+  "transformers_version": "4.8.2",
+  "use_cache": true,
+  "vocab_size": 50264,
+  "vocab_size_output": 16385
+}

environment.yaml ADDED Viewed

	@@ -0,0 +1,10 @@

+name: dalle
+channels:
+  - defaults
+dependencies:
+  - python=3.9.5
+  - pip=21.1.3
+  - ipython=7.22.0
+  - cudatoolkit
+  - pip:
+    - -r requirements.txt

flax_model.msgpack ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:856b78e6e59f979e319eef43005e913bf2e94ced9e3e93d87d3675373cf0673d
+size 1756329653

img/demo_screenshot.png ADDED Viewed

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pipeline.py ADDED Viewed

	@@ -0,0 +1,110 @@

+import jax
+import flax.linen as nn
+from transformers.models.bart.modeling_flax_bart import (
+    FlaxBartModule,
+    FlaxBartForConditionalGenerationModule,
+    FlaxBartForConditionalGeneration,
+    FlaxBartEncoder,
+    FlaxBartDecoder
+)
+from transformers import BartConfig
+from vqgan_jax.modeling_flax_vqgan import VQModel
+import numpy as np
+from PIL import Image
+# Model hyperparameters, for convenience
+OUTPUT_VOCAB_SIZE = 16384 + 1  # encoded image token space + 1 for bos
+OUTPUT_LENGTH = 256 + 1  # number of encoded tokens + 1 for bos
+BOS_TOKEN_ID = 16384
+BASE_MODEL = 'facebook/bart-large-cnn'  # we currently have issues with bart-large
+class CustomFlaxBartModule(FlaxBartModule):
+    def setup(self):
+        # check config is valid, otherwise set default values
+        self.config.vocab_size_output = getattr(self.config, 'vocab_size_output', OUTPUT_VOCAB_SIZE)
+        self.config.max_position_embeddings_decoder = getattr(self.config, 'max_position_embeddings_decoder', OUTPUT_LENGTH)
+        # we keep shared to easily load pre-trained weights
+        self.shared = nn.Embed(
+            self.config.vocab_size,
+            self.config.d_model,
+            embedding_init=jax.nn.initializers.normal(self.config.init_std, self.dtype),
+            dtype=self.dtype,
+        )
+        # a separate embedding is used for the decoder
+        self.decoder_embed = nn.Embed(
+            self.config.vocab_size_output,
+            self.config.d_model,
+            embedding_init=jax.nn.initializers.normal(self.config.init_std, self.dtype),
+            dtype=self.dtype,
+        )
+        self.encoder = FlaxBartEncoder(self.config, dtype=self.dtype, embed_tokens=self.shared)
+        # the decoder has a different config
+        decoder_config = BartConfig(self.config.to_dict())
+        decoder_config.max_position_embeddings = self.config.max_position_embeddings_decoder
+        decoder_config.vocab_size = self.config.vocab_size_output
+        self.decoder = FlaxBartDecoder(decoder_config, dtype=self.dtype, embed_tokens=self.decoder_embed)
+class CustomFlaxBartForConditionalGenerationModule(FlaxBartForConditionalGenerationModule):
+    def setup(self):
+        # check config is valid, otherwise set default values
+        self.config.vocab_size_output = getattr(self.config, 'vocab_size_output', OUTPUT_VOCAB_SIZE)
+        self.model = CustomFlaxBartModule(config=self.config, dtype=self.dtype)
+        self.lm_head = nn.Dense(
+            self.config.vocab_size_output,
+            use_bias=False,
+            dtype=self.dtype,
+            kernel_init=jax.nn.initializers.normal(self.config.init_std, self.dtype),
+        )
+        self.final_logits_bias = self.param("final_logits_bias", self.bias_init, (1, self.config.vocab_size_output))
+class CustomFlaxBartForConditionalGeneration(FlaxBartForConditionalGeneration):
+    module_class = CustomFlaxBartForConditionalGenerationModule
+class PreTrainedPipeline():
+    def __init__(self, path=""):
+        # IMPLEMENT_THIS
+        # Preload all the elements you are going to need at inference.
+        # For instance your model, processors, tokenizer that might be needed.
+        # This function is only called once, so do all the heavy processing I/O here"""
+        self.tokenizer = BartTokenizer.from_pretrained(path)
+        self.model = CustomFlaxBartForConditionalGeneration.from_pretrained(path)
+        self.vqgan = VQModel.from_pretrained("flax-community/vqgan_f16_16384", revision="90cc46addd2dd8f5be21586a9a23e1b95aa506a9")
+    def __call__(self, inputs: str):
+        """
+        Args:
+            inputs (:obj:`str`):
+                a string containing some text
+        Return:
+            A :obj:`PIL.Image` with the raw image representation as PIL.
+        """
+        tokenized_prompt = self.tokenizer(inputs, return_tensors='jax', padding='max_length', truncation=True, max_length=128)
+        key = jax.random.PRNGKey(random.randint(0, 2**32-1))
+        encoded_image = self.model.generate(**tokenized_prompt, do_sample=True, num_beams=1, prng_key=key)
+        # remove first token (BOS)
+        encoded_image = encoded_image.sequences[..., 1:]
+        decoded_image = vqgan.decode_code(encoded_image)
+        clipped_image = decoded_image.squeeze().clip(0., 1.)
+        return Image.fromarray(np.asarray(clipped_image * 255, dtype=np.uint8))

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+transformers
+flax
+git+https://github.com/patil-suraj/vqgan-jax.git

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": false, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "model_max_length": 1024, "special_tokens_map_file": null, "name_or_path": "./artifacts/model-4oh3u7ca:v54", "tokenizer_class": "BartTokenizer"}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff