Instructions to use Salesforce/instructblip-vicuna-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Salesforce/instructblip-vicuna-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Salesforce/instructblip-vicuna-7b")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b") model = AutoModelForImageTextToText.from_pretrained("Salesforce/instructblip-vicuna-7b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Salesforce/instructblip-vicuna-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Salesforce/instructblip-vicuna-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/instructblip-vicuna-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Salesforce/instructblip-vicuna-7b
- SGLang
How to use Salesforce/instructblip-vicuna-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Salesforce/instructblip-vicuna-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/instructblip-vicuna-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Salesforce/instructblip-vicuna-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/instructblip-vicuna-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Salesforce/instructblip-vicuna-7b with Docker Model Runner:
docker model run hf.co/Salesforce/instructblip-vicuna-7b
Can instructBlip process videos
I recently looked at the source of the blip2_vicuna-instruct7b on Salesforce/LAVIS repository and found a code for handling videos. I don't know if this is in the hugging face instructBlip model. So I'm asking if instructBlip can handle videos and if yes, how do I go about it?
Hi,
Thanks for your interest in InstructBLIP. Support for videos is not yet present in the Transformers library. Did the authors release any checkpoints trained on video?
I'm unaware of that currently. I'd check to see if there is. What I saw was just a code line for handling videos with low frame count.
also interested in processing videos
Can you share the snippet for handling videos from the original authors? That can be probably adapted a bit to use transformers model
Hi,
I'm trying to run the demo from the page https://huggingface.co/docs/transformers/main/en/model_doc/instructblip#transformers.InstructBlipForConditionalGeneration at the end and the model des not generate text instead is give this :
Loading checkpoint shards: 100%|██████████| 4/4 [00:24<00:00, 6.10s/it]
/home/tanya.kaintura/Project/myenv/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:412: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
warnings.warn(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Hi,
For videos I recommend taking a look at VideoBLIP: https://huggingface.co/models?other=video-to-text
Update, InstructBLIP-Video is now supported! https://huggingface.co/docs/transformers/main/en/model_doc/instructblipvideo
Update, InstructBLIP-Video is now supported! https://huggingface.co/docs/transformers/main/en/model_doc/instructblipvideo
Unfortunately, the generation example on the page is not working. Additionally, could you please provide an example for video and text feature extraction? Error message : TypeError: InstructBlipVideoForConditionalGeneration.forward() got an unexpected keyword argument 'videos'
@CennetOguz the example code had typos, will fix it on main soon. You can use the following code to generate:
from transformers import InstructBlipVideoProcessor, InstructBlipVideoForConditionalGeneration
import torch
from huggingface_hub import hf_hub_download
import av
import numpy as np
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
model = InstructBlipVideoForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", device_map="auto")
processor = InstructBlipVideoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b")
file_path = hf_hub_download(repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset")
container = av.open(file_path)
# sample uniformly 4 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 4).astype(int)
clip = read_video_pyav(container, indices)
prompt = "What is happening in the video?"
inputs = processor(text=prompt, images=clip, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
do_sample=False,
num_beams=5,
max_length=256,
repetition_penalty=1.5,
length_penalty=1.0,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
Regarding feature extraction:
From the project repo it looks like InstructBlip models do not support feature extraction since they do not have a projection head to project text/vision embeds to the same latent space. However Blip2 has support for feature extraction given in this notebook. The PR to add ITM capability to Transformers Blip2 is in progress, you can track it here
@CennetOguz the example code had typos, will fix it on
mainsoon. You can use the following code to generate:from transformers import InstructBlipVideoProcessor, InstructBlipVideoForConditionalGeneration import torch from huggingface_hub import hf_hub_download import av import numpy as np def read_video_pyav(container, indices): ''' Decode the video with PyAV decoder. Args: container (`av.container.input.InputContainer`): PyAV container. indices (`List[int]`): List of frame indices to decode. Returns: result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3). ''' frames = [] container.seek(0) start_index = indices[0] end_index = indices[-1] for i, frame in enumerate(container.decode(video=0)): if i > end_index: break if i >= start_index and i in indices: frames.append(frame) return np.stack([x.to_ndarray(format="rgb24") for x in frames]) model = InstructBlipVideoForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", device_map="auto") processor = InstructBlipVideoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b") file_path = hf_hub_download(repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset") container = av.open(file_path) # sample uniformly 4 frames from the video total_frames = container.streams.video[0].frames indices = np.arange(0, total_frames, total_frames / 4).astype(int) clip = read_video_pyav(container, indices) prompt = "What is happening in the video?" inputs = processor(text=prompt, images=clip, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, do_sample=False, num_beams=5, max_length=256, repetition_penalty=1.5, length_penalty=1.0, ) generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip() print(generated_text)Regarding feature extraction:
From the project repo it looks like InstructBlip models do not support feature extraction since they do not have a projection head to project text/vision embeds to the same latent space. However Blip2 has support for feature extraction given in this notebook. The PR to add ITM capability to Transformers Blip2 is in progress, you can track it here
can this only capture 4 frames?when i try more frames,it will cause errors like:inputs_embeds[special_image_mask] = language_model_inputs.flatten().to(inputs_embeds.device)
RuntimeError: shape mismatch: value tensor of shape [1441792] cannot be broadcast to indexing result of shape [524288]
@dasdasxcaxja yes, it works only with 4 frames. We used the same inference as in orig implementation where frame count is non-configurable