luisdomene4
/

BLIP2-Finetune-Recipes

Model card Files Files and versions

BLIP2-Finetune-Recipes / README.md

luisdomene4's picture

Update README.md

c27b330 verified about 1 year ago

|

history blame contribute delete

1.55 kB

	---
	base_model: BLIP-2
	library_name: peft
	---

	# Model Card for BLIP-2: Bootstrapping Language-Image Pre-training

	BLIP-2 is a unified vision-language model designed for tasks such as image captioning, visual question answering, and more. It employs a novel pre-training strategy that leverages frozen pre-trained image encoders and large language models (LLMs) to efficiently bridge the modality gap between vision and language.

	## Model Details

	### Model Description

	BLIP-2 (Bootstrapping Language-Image Pre-training) introduces a lightweight Querying Transformer (Q-Former) that connects a frozen image encoder with a frozen LLM. This architecture enables effective vision-language understanding and generation without the need for end-to-end training of large-scale models. The model is capable of zero-shot image-to-text generation and can follow natural language instructions.

	- Developed by: Salesforce AI Research
	- Funded by: Salesforce
	- Shared by: Official BLIP-2 repository
	- Model type: Vision-language model
	- Language(s): English
	- Finetuned from model: BLIP-2 base pretrained on COCO dataset

	### Model Sources

	- Repository: [BLIP-2 Official GitHub](https://github.com/salesforce/LAVIS/tree/main/projects/blip2)
	- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)
	- Dataset: [Recipes Dataset](https://www.kaggle.com/datasets/pes12017000148/food-ingredients-and-recipe-dataset-with-images)