Instructions to use Salesforce/instructblip-vicuna-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Salesforce/instructblip-vicuna-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Salesforce/instructblip-vicuna-7b")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b")
model = AutoModelForImageTextToText.from_pretrained("Salesforce/instructblip-vicuna-7b")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Salesforce/instructblip-vicuna-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Salesforce/instructblip-vicuna-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Salesforce/instructblip-vicuna-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Salesforce/instructblip-vicuna-7b

SGLang

How to use Salesforce/instructblip-vicuna-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Salesforce/instructblip-vicuna-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Salesforce/instructblip-vicuna-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Salesforce/instructblip-vicuna-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Salesforce/instructblip-vicuna-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Salesforce/instructblip-vicuna-7b with Docker Model Runner:
```
docker model run hf.co/Salesforce/instructblip-vicuna-7b
```

Can instructBlip process videos

by UncleanCode - opened Sep 14, 2023

Discussion

UncleanCode

Sep 14, 2023

I recently looked at the source of the blip2_vicuna-instruct7b on Salesforce/LAVIS repository and found a code for handling videos. I don't know if this is in the hugging face instructBlip model. So I'm asking if instructBlip can handle videos and if yes, how do I go about it?

nielsr

Sep 15, 2023

Hi,

Thanks for your interest in InstructBLIP. Support for videos is not yet present in the Transformers library. Did the authors release any checkpoints trained on video?

UncleanCode

Sep 16, 2023

I'm unaware of that currently. I'd check to see if there is. What I saw was just a code line for handling videos with low frame count.

louis030195

Oct 8, 2023

also interested in processing videos

ybelkada

Oct 9, 2023

Hi @UncleanCode @louis030195

Can you share the snippet for handling videos from the original authors? That can be probably adapted a bit to use transformers model

tkaintura

Feb 23, 2024

Hi,

I'm trying to run the demo from the page https://huggingface.co/docs/transformers/main/en/model_doc/instructblip#transformers.InstructBlipForConditionalGeneration at the end and the model des not generate text instead is give this :

Loading checkpoint shards: 100%|██████████| 4/4 [00:24<00:00, 6.10s/it]
/home/tanya.kaintura/Project/myenv/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:412: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.
warnings.warn(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

nielsr

Feb 24, 2024

Hi,

For videos I recommend taking a look at VideoBLIP: https://huggingface.co/models?other=video-to-text

nielsr

Apr 11, 2024

PR is open for it now here: https://github.com/huggingface/transformers/pull/30182

nielsr

Jun 28, 2024

Update, InstructBLIP-Video is now supported! https://huggingface.co/docs/transformers/main/en/model_doc/instructblipvideo

CennetOguz

Jul 10, 2024

•

edited Jul 10, 2024

Update, InstructBLIP-Video is now supported! https://huggingface.co/docs/transformers/main/en/model_doc/instructblipvideo

Unfortunately, the generation example on the page is not working. Additionally, could you please provide an example for video and text feature extraction? Error message : TypeError: InstructBlipVideoForConditionalGeneration.forward() got an unexpected keyword argument 'videos'

nielsr

Jul 10, 2024

Pinging @RaushanTurganbay here

RaushanTurganbay

Jul 10, 2024

•

edited Jul 10, 2024

@CennetOguz the example code had typos, will fix it on main soon. You can use the following code to generate:

from transformers import InstructBlipVideoProcessor, InstructBlipVideoForConditionalGeneration
import torch
from huggingface_hub import hf_hub_download
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = InstructBlipVideoForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", device_map="auto")
processor = InstructBlipVideoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b")

file_path = hf_hub_download(repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset")
container = av.open(file_path)

# sample uniformly 4 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 4).astype(int)
clip = read_video_pyav(container, indices)

prompt = "What is happening in the video?"
inputs = processor(text=prompt, images=clip, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    do_sample=False,
    num_beams=5,
    max_length=256,
    repetition_penalty=1.5,
    length_penalty=1.0,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

Regarding feature extraction:
From the project repo it looks like InstructBlip models do not support feature extraction since they do not have a projection head to project text/vision embeds to the same latent space. However Blip2 has support for feature extraction given in this notebook. The PR to add ITM capability to Transformers Blip2 is in progress, you can track it here

dasdasxcaxja

May 4, 2025

@CennetOguz the example code had typos, will fix it on main soon. You can use the following code to generate:

from transformers import InstructBlipVideoProcessor, InstructBlipVideoForConditionalGeneration
import torch
from huggingface_hub import hf_hub_download
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = InstructBlipVideoForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", device_map="auto")
processor = InstructBlipVideoProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b")

file_path = hf_hub_download(repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset")
container = av.open(file_path)

# sample uniformly 4 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 4).astype(int)
clip = read_video_pyav(container, indices)

prompt = "What is happening in the video?"
inputs = processor(text=prompt, images=clip, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    do_sample=False,
    num_beams=5,
    max_length=256,
    repetition_penalty=1.5,
    length_penalty=1.0,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

can this only capture 4 frames?when i try more frames,it will cause errors like:inputs_embeds[special_image_mask] = language_model_inputs.flatten().to(inputs_embeds.device)
RuntimeError: shape mismatch: value tensor of shape [1441792] cannot be broadcast to indexing result of shape [524288]

RaushanTurganbay

May 6, 2025

@dasdasxcaxja yes, it works only with 4 frames. We used the same inference as in orig implementation where frame count is non-configurable

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment