YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
AI Journey 2023 Baseline solution
This solution is inspired by the methodologies of FROMAGe and Kosmos-1. It primarily employs these approaches to fine-tune the linear mapping from visual and audio vector spaces into the language model-decoder's vector space. Subsequently, the response is generated exclusively using the intact language model.
As a modality encoder, we utilize ImageBind. This encoder has been trained specifically for understanding images, audio, and text and other data formats in a shared embedding space.
During the training phase, the weights of the encoder and the language model remain frozen. The exceptions to this are the additional embeddings for two tokens marking the beginning and end of the respective modalities in the language model: <SOI>, <EOI> and <SOA>, <EOA> (S, E โ Start, End; I,A โ Image, Audio).
Training is leveraged using four datasets: VisualDialogues, COCO Captions, Clotho v. 2.1 and Clotho-AQA. The core training objective is next token prediction with CrossEntropy loss. The general architecture is illustrated here:
Training
To reproduce training, please run notebook after installing requirements:
pip install requirements.txt