How does the Fuyu model Get images?

#45

by VatsaDev - opened Nov 7, 2023

Nov 7, 2023

The Q above, because from what im seeing, you take an image, split it into rows, and give that to the model, and it supossably has no real difference from permission 8b. Like how are the images going in? From what I can tell, youre not making image embeddings, so hows the model understanding images?

Molbap

Nov 7, 2023

Hi @VatsaDev , not sure I understand your question exactly but the model does have a vision layer. It is simply linear, but it does create an embedding vector of required dimension from each patch. Then as you said the embeddings are combined with the text embeddings from the prompt tokens and fed into a Persimmon-8b like architecture.

I recommend inspecting the modeling code here to get a better sense of what the model is doing: https://github.com/huggingface/transformers/blob/9beb2737d758160e845b66742a0c01201e38007f/src/transformers/models/fuyu/modeling_fuyu.py#L154C1-L158C10

VatsaDev

Nov 7, 2023

ok, so your visual layer is turning images to embeddings through an nn.linear class?

Did you really have to train it, or does image to embedding just work?

Also, Im sorry if this is too much, but im new to pytorch, learning it, could you give me code example of image -> embedding -> image?

bn22

Nov 10, 2023

ok, so your visual layer is turning images to embeddings through an nn.linear class?

Did you really have to train it, or does image to embedding just work?

Also, Im sorry if this is too much, but im new to pytorch, learning it, could you give me code example of image -> embedding -> image?

The linear layer has to be trained.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment