Instructions to use TIGER-Lab/VLM2Vec-LLaVa-Next with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TIGER-Lab/VLM2Vec-LLaVa-Next with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="TIGER-Lab/VLM2Vec-LLaVa-Next")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("TIGER-Lab/VLM2Vec-LLaVa-Next")
model = AutoModelForImageTextToText.from_pretrained("TIGER-Lab/VLM2Vec-LLaVa-Next")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TIGER-Lab/VLM2Vec-LLaVa-Next with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TIGER-Lab/VLM2Vec-LLaVa-Next"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TIGER-Lab/VLM2Vec-LLaVa-Next",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/TIGER-Lab/VLM2Vec-LLaVa-Next

SGLang

How to use TIGER-Lab/VLM2Vec-LLaVa-Next with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TIGER-Lab/VLM2Vec-LLaVa-Next" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TIGER-Lab/VLM2Vec-LLaVa-Next",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TIGER-Lab/VLM2Vec-LLaVa-Next" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TIGER-Lab/VLM2Vec-LLaVa-Next",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use TIGER-Lab/VLM2Vec-LLaVa-Next with Docker Model Runner:
```
docker model run hf.co/TIGER-Lab/VLM2Vec-LLaVa-Next
```

Could you release the best model (LLaVA-1.6 + LoRA) reported in the paper?

by yibingwei - opened May 28, 2025

Discussion

yibingwei

May 28, 2025

Hi, thank you for your great work on VLM2Vec! I have a quick question regarding the models you released.

According to the paper, the best-performing model on ImageNet-1K is LLaVA-1.6 finetuned with LoRA, which achieves top-1 accuracy of 0.745. However, the currently available TIGER-Lab/VLM2Vec-LLaVa-Next seems to be fully finetuned, as there's no adapter_config.json in the repo.

I evaluated this model (TIGER-Lab/VLM2Vec-LLaVa-Next) using the command below and obtained an ImageNet-1K accuracy of only 0.207, which is far from the reported result. Here's the command I used :

python eval.py \
  --model_name TIGER-Lab/VLM2Vec-LLaVa-Next \
  --model_backbone llava_next \
  --encode_output_path llava_next_outputs/ \
  --image_resolution high \
  --num_crops 4 \
  --max_len 256 \
  --pooling last \
  --normalize True \
  --dataset_name TIGER-Lab/MMEB-eval \
  --subset_name ImageNet-1K \
  --dataset_split test \
  --per_device_eval_batch_size 2 \
  --image_dir eval_images/

In contrast, when I evaluated TIGER-Lab/VLM2Vec-LoRA on the lora setup, I got a 0.68 accuracy, which seems much closer to the expected performance.

Would it be possible to release the LLaVA-1.6 + LoRA model used in the paper, or provide instructions to reproduce it (e.g., adapter weights and configuration)?

Thanks again for your time and amazing work!

LightSunKing

May 29, 2025

same problem. when i use this command to eval model, i only get the 0.015 and 0.029 on MSCOCO_i2t and ViusualNews_i2t

ziyjiang

TIGER-Lab org May 29, 2025

Thanks for letting me know. I will take a look soon and update here.
BTW, this is the model fine-tuned with LoRA, I merged it to the full model so that it will be more convenient for people to use.

ziyjiang

TIGER-Lab org Jun 4, 2025

Hi @yibingwei @LightSunKing , thanks a lot for bringing up this issue.
Regarding the low results, they were caused by the --max_len 256 parameter, which truncated the image tokens.
You can simply remove this parameter, and the results should then be reproducible.

This parameter can be a bit confusing. For some models' processors, it represents max_text_length, in which case it's fine to use. But for others, it refers to the combined length of image and text tokens, in which case it should be removed.

I'll update the documentation to clarify this and avoid future confusion. As a general rule, it's safer not to use this parameter.

Also, just FYI, our best-performing models are now the VLM2Vec_Qwen series (https://huggingface.co/collections/TIGER-Lab/vlm2vec-6705f418271d085836e0cdd5). We’ll also be releasing the VLM2Vec_v2 series of code and models later this week, which will offer even better performance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment