Instructions to use Salesforce/blip2-opt-2.7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Salesforce/blip2-opt-2.7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Salesforce/blip2-opt-2.7b")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") model = AutoModelForMultimodalLM.from_pretrained("Salesforce/blip2-opt-2.7b") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Salesforce/blip2-opt-2.7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Salesforce/blip2-opt-2.7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/blip2-opt-2.7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Salesforce/blip2-opt-2.7b
- SGLang
How to use Salesforce/blip2-opt-2.7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Salesforce/blip2-opt-2.7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/blip2-opt-2.7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Salesforce/blip2-opt-2.7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/blip2-opt-2.7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Salesforce/blip2-opt-2.7b with Docker Model Runner:
docker model run hf.co/Salesforce/blip2-opt-2.7b
BLIP2 for retrieval
There's an effort to add it: https://github.com/huggingface/transformers/pull/29261
but it seems that there is no proper model on hugging-face for Blip2ForImageTextRetrieval? existing models could not be rightly loaded on retrieval tasks.
The PR above has been merged, so the Blip2ForImageTextRetrieval class is now available.
There are 2 checkpoints available:
The PR above has been merged, so the
Blip2ForImageTextRetrievalclass is now available.There are 2 checkpoints available:
- https://huggingface.co/Salesforce/blip2-itm-vit-g
- https://huggingface.co/Salesforce/blip2-itm-vit-g-coco
as I know, blip2-itm-vit-g do not work well.
here is the logs:
Some weights of the model checkpoint at ../Salesforce/blip2-itm-vit-g were not used when initializing Blip2ForImageTextRetrieval: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'qformer.embeddings.LayerNorm.bias', 'qformer.embeddings.LayerNorm.weight', 'qformer.embeddings.position_embeddings.weight', 'qformer.embeddings.word_embeddings.weight', 'temp', 'text_proj.bias', 'text_proj.weight', 'vision_proj.bias', 'vision_proj.weight']
- This IS expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Blip2ForImageTextRetrieval were not initialized from the model checkpoint at /home/pyr/pretrained_models/Salesforce/blip2-itm-vit-g and are newly initialized: ['embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'qformer.layernorm.bias', 'qformer.layernorm.weight', 'text_projection.bias', 'text_projection.weight', 'vision_projection.bias', 'vision_projection.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Hi,
It looks like you may need to update your Transformers version, the following code snippet works for me:
import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2ForImageTextRetrieval
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")
model.to(device) # doctest: +IGNORE_RESULT
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats laying on a pink blanket"
inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
itm_out = model(**inputs, use_image_text_matching_head=True)
logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
print("Probs:", probs)
which prints:
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)
Hi,
It looks like you may need to update your Transformers version, the following code snippet works for me:
import torch from PIL import Image import requests from transformers import AutoProcessor, Blip2ForImageTextRetrieval device = "cuda" if torch.cuda.is_available() else "cpu" model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16) processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g") model.to(device) # doctest: +IGNORE_RESULT url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) text = "two cats laying on a pink blanket" inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16) itm_out = model(**inputs, use_image_text_matching_head=True) logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1) probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities print("Probs:", probs)which prints:
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47. Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)
Thank you for your reply; it was very helpful to me. It indeed seems to be a version issue.
Furthermore, may I ask another question? In the examples of the lavis library, both the extracted image features and text features are multiple low-dimensional vectors.
However, I noticed that using
image_emb = model.extract_features(sample, mode="image").image_embeds[:,0,:] # size (768)
text_emb = model.extract_features(sample, mode="text").text_embeds[:,0,:] # size (768)
seems to also work.
What is the difference between these two methods? Or is there detailed documentation available somewhere? Thank you.