AI Journey 2023 Baseline solution

This solution is inspired by the methodologies of FROMAGe and Kosmos-1. It primarily employs these approaches to fine-tune the linear mapping from visual and audio vector spaces into the language model-decoder's vector space. Subsequently, the response is generated exclusively using the intact language model.

As a modality encoder, we utilize ImageBind. This encoder has been trained specifically for understanding images, audio, and text and other data formats in a shared embedding space.

During the training phase, the weights of the encoder and the language model remain frozen. The exceptions to this are the additional embeddings for two tokens marking the beginning and end of the respective modalities in the language model: <SOI>, <EOI> and <SOA>, <EOA> (S, E — Start, End; I,A — Image, Audio).

Training is leveraged using four datasets: VisualDialogues, COCO Captions, Clotho v. 2.1 and Clotho-AQA. The core training objective is next token prediction with CrossEntropy loss. The general architecture is illustrated here:

Baseline model architecture

Training

To reproduce training, please run notebook after installing requirements:

pip install requirements.txt

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support