Spaces:
Runtime error
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou, Le Xue*, Ning Yu*, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles
*equal mentorship contribution.
Overview
X-InstructBLIP a simple yet effective multimodal framework built on top of a frozen LLM, capable of seamlessly integrating and managing an ad-hoc number of modalities. Despite each modality projection being trained individually, X-InstructBLIP demonstrates joint reasoning abilities on par with models specifically trained on combined-modality datasets, such as video-audio.

Installation
LAVIS Repository
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS-XInstructBLIP
pip install -e .
Evaluation Dependencies
For the evaluation code, make sure that java is installed in the system by running
apt-get update
apt-get install default-jre
KNN Cuda
Install KNN_Cuda which is required for the 3D encoder.
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
Make sure that ninja is installed:
wget -P /usr/bin https://github.com/unlimblue/KNN_CUDA/raw/master/ninja
Install NLTK-CMUDICT
>>> import nltk
>>> nltk.download('cmudict')
Pretrained Models
Language Model Weights
First download the Vicuna v1.1 weights following the instructions here. Update the parameter llm_model in configs/models/blip2/blip2_xinstruct_vicuna7b.yaml and configs/models/blip2/blip2_xinstruct_vicuna13b.yaml and in the demo configs under projects/xinstructblip/demo/configs to the path of the downloaded model folder.
X-InstructBLIP Weights
Weights of the model are released here (7b) and here (13b) . When loading the model using the LAVIS codebase they should be automatically downloaded.
from lavis.models import load_model
model = load_model("blip2_vicuna_xinstruct", "vicuna7b")
BEATs pretrained model
Download the pretrained weights of the BEATs encoder here and update the key
audio_encoder_kwargs : {"checkpoint_path": "/path/to/BEATs_iter3_plus_AS2M.pt"}
in the X-InstructBLIP (7b) and (13b) configs found in configs/models/blip2/blip2_xinstruct_vicuna7b.yaml and configs/models/blip2/blip2_xinstruct_vicuna13b.yaml respectively.
Config
X-InstructBLIP can be modified in various ways based on the config. Here is a documentation of each input field.
General Settings
- arch: Architecture of the model,
blip2_vicuna_xinstruct. - model_type: Specific model variant,
vicuna7b. - load_pretrained: Boolean to load a pre-trained model (
True/False).
Pretrained Model Paths
- pretrained: Path to the pretrained model.
- load_finetuned: Load a finetuned model version (
True/False). - finetuned: Path to the finetuned model.
Model Components
- image_model, pc_model, video_model, audio_model: Models for image, point cloud, video, and audio encoders.
Pretrained Component Paths
- pretrained_image_qformer, pretrained_pc_qformer, pretrained_video_qformer: Paths to pretrained models for image, point cloud, and video qformers.
Load Attention and Projection Settings
- Parameters for loading attention mechanisms and projections in image, PC, video, and audio components.
- projection_only_{modality}: use projection instead of q-former for specified modality
- load_attention_{modality}_qformer: load pretrained q-former cross-attention to text for particular modality
- load_ln_type_{modality}: key specification of encoder layer norm to load from state_dict. Will be of the form
{load_ln_type}_ln. - load_qformer_type_{modality}: key specification of q-former to load from state_dict. Will be of the form
{load_qformer_type}_qformer. - load_projection_{modality}: boolean whether to load pretrained LLM projection for modality.
Encoder Settings
- image_encoder_kwargs, pc_encoder_kwargs, video_encoder_kwargs, audio_encoder_kwargs: Encoder arguments for various modalities.
Precision and Freeze Settings
- image_precision, pc_precision, video_precision, audio_precision: Precision (e.g., "fp16") for different modalities.
- freeze_image, freeze_pc, freeze_video, freeze_audio: Freeze respective model components.
Query and Text Settings
- num_query_token: Number of query tokens.
- llm_model: Path to the language model.
- prompt: Query prompt format.
- max_txt_len, max_output_txt_len: Max lengths for input and output text.
- apply_lemmatizer: Use lemmatizer on text.
Text Input
- qformer_text_input, llm_text_input: Whether to use text input for q-former and language model.
Modality and Cues
- modalities: List of used modalities, e.g., ["audio"].
- use_cues: Use cues in the model.
Shared Q-Former Settings
- Shared qformer related parameters, including paths, load settings, and dimensions.
Additional Settings
- prefix, postfix: Texts for model input.
- predict_with_gen: Model prediction includes generation.
- proj_dim: Projection hidden dimension, defaults to 1.
- special_qformer_input_prompt: different q-former input prompt than the llm model.
Demo
Gradio based demo is available:
cd projects/xinstructblip/demo
python run_demo.py --model_type vicuna7b
The model configs can be adapted, and are found in projects/xinstructblip/demo/configs
Result Replication
A collection of configs for training and evaluation of all the numbers reported in the paper are available at:
lavis/projects/xinstructblip/train and lavis/projects/xinstructblip/eval. For internal reproducibility the data paths are absolute paths in the servers. Replace those with the corresponding data links and storage paths for your data as downloaded in section Data.
Files can be run as follows:
python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path path/to/train/or/eval/config
for example to train the 3D Q-Former one can run
python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/xinstruct_blip/train/vicuna7b/pc_training.yaml
It is recommended to use 40GB GPU RAM. If this is not available, results can be partially replicated by loading the Vicuna LLM model in 8bit instead of 16bit. More details here.
Data
X-InstructBLIP is trained and evaluated in a collection of public and generated datasets as shown in the figure below.

The data configuration files in LAVIS-xgen_mm/configs/datasets will automatically download the data annotation files when loading the datasets. The (image, audio, video, 3D) data can be downloaded from the following links. Once the datasets are downloaded, update the storage field in corresponding config file with the path in which the data is located.
Image
Train
- Conceptual Captions 12m
- SBU Captions
- Visual Genome
- COCO
- CapFilt combination of CC3M, CC12M, and SBU
- LLaVa150k
- VGQA: see Visual Genome
- VQAv2
- OKVQA
- AOKVQA
- OCRVQA
Eval
Audio
Train
Eval
Video
Train
Eval
3D
Data Augmentation
We release the scripts to generate data both for the training data augmentation and generating the DisCRn dataset.
QA Data Augmentation
The files projects/xinstructblip/data_aug/3d_qa_data_generation.py and projects/xinstructblip/data_aug/audio_qa_data_generation.py are used to generate the 3D and Audio QA data from Cap3D and Audiocaps respectively.
3DQA
Download the captions for the objaverse data here. Then the script takes 4 arguments:
shard: 1 through 4, for speed the data is processed in 4 batches and can be spread across machines or run consecutively in the same mahinemode:color_removalremoves color concepts form captions since the encoder does not encode color,qa_gengenerates question answer pairs, andrtcperforms the roundtrip consistency check,allruns all of the process in this order.split:trainorval. Assumes a separation of the data which can be found in thesetrainanddevfiles. If the split is not added to the csv then it can be set arbitrarily and all the data in the file will be processed.original_data_file: the path to the captions for objaverse downloaded above.
AudioQA
Download the Audiocaps captions from here. The script takes similar arguments as above:
shard: 1 through 4, for speed the data is processed in 4 batches and can be spread across machines or run consecutively in the same mahinemode:qa_gengenerates question answer pairs, andrtcperforms the roundtrip consistency check,allruns all of the process in this order.split:train,val,test.original_data_file: the path to the captions for Audiocaps downloaded above for the relevant split.
DisCRn
The dataset is found here: Audio-Video and Image-3D.
The files projects/xinstructblip/discrn/data_generation/objaverse_img_3d.py are projects/xinstructblip/discrn/data_generation/audiocaps_video_audio.py generate the image-3d and audio-video cross-modal reasoning pairs for the DisCRn task.
Image-3D
The arguments are as above, with the same 3D caption data
shardoriginal_data_filesplitmode: in this case it can take the valuespropertyto generate properties,get_pairsto identify new pairs,instruction_gento generate instruction answer pairs andrtcto perform round trip consistency.allwill run them all in order.rnd: adds identifier in output files in the case of multiple generations.
Audio Video
The arguments are as above, with the same audio caption data. Note that you should update VIDEO_PATH in the script with the equivalent downloaded videos from Audiocaps. An easy way to do so is using the youtube-id associated with each of them and a package such as youtube-dl
shardoriginal_data_filesplitmode: in this case it can take the valuespropertyto generate properties,get_pairsto identify new pairs,instruction_gento generate instruction answer pairs andrtcto perform round trip consistency.allwill run them all in order.rnd: adds identifier in output files in the case of multiple generations.
Cite
@misc{panagopoulou2023xinstructblip,
title={X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
representations to LLMs and Emergent Cross-modal Reasoning},
author={Artemis Panagopoulou and Le Xue and Ning Yu and Junnan Li and Dongxu Li and
Shafiq Joty and Ran Xu and Silvio Savarese and Caiming Xiong and Juan Carlos Niebles},
year={2023},
eprint={2311.18799},
archivePrefix={arXiv},
primaryClass={cs.CV}
}