Does this model contain the MLP, that translates the embedding space of the ViT into the decoder's embedding space?
· Sign up or log in to comment