mlpc-lab
/

BLIVA_Vicuna

Visual Question Answering

Model card Files Files and versions

BLIVA_Vicuna / README.md

gordonhu's picture

Update README.md

5943d53 over 2 years ago

|

history blame contribute delete

1.68 kB

	---
	language:
	- en
	pipeline_tag: visual-question-answering
	library_name: transformers

	inference: false
	---

	<br>
	<br>

	# BLIVA Model Card

	## Model details

	Model type:
	BLIVA is an open-source Vision-Languagde model trained by initializing from InstructBLIP and alignment with Vicuna on multimodal instruction-finetuning data.
	It composes of an EVA-CLIP vision encoder, a Q-Former, a projection layer and an auto-regressive language model, based on the decoder only transformer architecture.

	Model date:
	BLIVA_Vicuna was trained in July 2023.

	Paper or resources for more information:
	https://gordonhu608.github.io/bliva/

	License:
	Non-commercial bespoke license

	Where to send questions or comments about the model:
	https://github.com/mlpc-ucsd/BLIVA

	## Intended use
	Primary intended uses:
	The primary use of BLIVA is research on large multimodal models.

	Primary intended users:
	The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

	## Training dataset
	Pre-train data: 558K filtered image-text pairs from LAION,CC-3M, and SBU. Selected by LLaVA.

	Instruction-finetuning data: COCO-Caption, TextCaps, VQAv2, OKVQA, A-OKVQA, LLaVA-150K, OCR-VQA.

	## Evaluation dataset
	For zero-shot evaluation on general image task, we selected Nocaps, Flickr30K, VizWiz, Visual Spaial Reasoning (VSR), IconQA, Visual Dialog, ScienceQA, MSRVTT QA, TextVQA and Hateful Memes.

	For zero-shot evaluation on text-rich image OCR task, we selected ST-VQA, OCR-VQA, Text-VQA, and Doc-VQA.

	More detials are in our github, https://github.com/mlpc-ucsd/BLIVA