Instructions to use Salesforce/blip2-opt-2.7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Salesforce/blip2-opt-2.7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Salesforce/blip2-opt-2.7b")

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = AutoModelForMultimodalLM.from_pretrained("Salesforce/blip2-opt-2.7b")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Salesforce/blip2-opt-2.7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Salesforce/blip2-opt-2.7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Salesforce/blip2-opt-2.7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Salesforce/blip2-opt-2.7b

SGLang

How to use Salesforce/blip2-opt-2.7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Salesforce/blip2-opt-2.7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Salesforce/blip2-opt-2.7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Salesforce/blip2-opt-2.7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Salesforce/blip2-opt-2.7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Salesforce/blip2-opt-2.7b with Docker Model Runner:
```
docker model run hf.co/Salesforce/blip2-opt-2.7b
```

BLIP2 for retrieval

#27

by deleted - opened Feb 26, 2024

Discussion

deleted

Feb 26, 2024

Is there a way to use the huggingface model to do cross modal retrieval tasks?

nielsr

Feb 26, 2024

There's an effort to add it: https://github.com/huggingface/transformers/pull/29261

skycyou

Sep 12, 2024

There's an effort to add it: https://github.com/huggingface/transformers/pull/29261

but it seems that there is no proper model on hugging-face for Blip2ForImageTextRetrieval? existing models could not be rightly loaded on retrieval tasks.

nielsr

Sep 12, 2024

•

edited Sep 12, 2024

The PR above has been merged, so the Blip2ForImageTextRetrieval class is now available.

There are 2 checkpoints available:

skycyou

Sep 12, 2024

•

edited Sep 12, 2024

The PR above has been merged, so the Blip2ForImageTextRetrieval class is now available.

There are 2 checkpoints available:

https://huggingface.co/Salesforce/blip2-itm-vit-g

https://huggingface.co/Salesforce/blip2-itm-vit-g-coco
as I know, blip2-itm-vit-g do not work well.

here is the logs:
Some weights of the model checkpoint at ../Salesforce/blip2-itm-vit-g were not used when initializing Blip2ForImageTextRetrieval: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'qformer.embeddings.LayerNorm.bias', 'qformer.embeddings.LayerNorm.weight', 'qformer.embeddings.position_embeddings.weight', 'qformer.embeddings.word_embeddings.weight', 'temp', 'text_proj.bias', 'text_proj.weight', 'vision_proj.bias', 'vision_proj.weight']

This IS expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Blip2ForImageTextRetrieval were not initialized from the model checkpoint at /home/pyr/pretrained_models/Salesforce/blip2-itm-vit-g and are newly initialized: ['embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'qformer.layernorm.bias', 'qformer.layernorm.weight', 'text_projection.bias', 'text_projection.weight', 'vision_projection.bias', 'vision_projection.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.

nielsr

Sep 14, 2024

Hi,

It looks like you may need to update your Transformers version, the following code snippet works for me:

import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2ForImageTextRetrieval

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")

model.to(device)  # doctest: +IGNORE_RESULT

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats laying on a pink blanket"

inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
itm_out = model(**inputs, use_image_text_matching_head=True)
logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

print("Probs:", probs)

which prints:

Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)

skycyou

Sep 18, 2024

•

edited Sep 18, 2024

Hi,

It looks like you may need to update your Transformers version, the following code snippet works for me:

import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2ForImageTextRetrieval

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")

model.to(device)  # doctest: +IGNORE_RESULT

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats laying on a pink blanket"

inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
itm_out = model(**inputs, use_image_text_matching_head=True)
logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

print("Probs:", probs)

which prints:

Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)

Thank you for your reply; it was very helpful to me. It indeed seems to be a version issue.
Furthermore, may I ask another question? In the examples of the lavis library, both the extracted image features and text features are multiple low-dimensional vectors.
However, I noticed that using

image_emb = model.extract_features(sample, mode="image").image_embeds[:,0,:] # size (768)
text_emb = model.extract_features(sample, mode="text").text_embeds[:,0,:] # size (768)

seems to also work.
What is the difference between these two methods? Or is there detailed documentation available somewhere? Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment