Instructions to use TIGER-Lab/VLM2Vec-LLaVa-Next with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TIGER-Lab/VLM2Vec-LLaVa-Next with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="TIGER-Lab/VLM2Vec-LLaVa-Next") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("TIGER-Lab/VLM2Vec-LLaVa-Next") model = AutoModelForImageTextToText.from_pretrained("TIGER-Lab/VLM2Vec-LLaVa-Next") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TIGER-Lab/VLM2Vec-LLaVa-Next with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TIGER-Lab/VLM2Vec-LLaVa-Next" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TIGER-Lab/VLM2Vec-LLaVa-Next", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/TIGER-Lab/VLM2Vec-LLaVa-Next
- SGLang
How to use TIGER-Lab/VLM2Vec-LLaVa-Next with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TIGER-Lab/VLM2Vec-LLaVa-Next" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TIGER-Lab/VLM2Vec-LLaVa-Next", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TIGER-Lab/VLM2Vec-LLaVa-Next" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TIGER-Lab/VLM2Vec-LLaVa-Next", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use TIGER-Lab/VLM2Vec-LLaVa-Next with Docker Model Runner:
docker model run hf.co/TIGER-Lab/VLM2Vec-LLaVa-Next
Could you release the best model (LLaVA-1.6 + LoRA) reported in the paper?
Hi, thank you for your great work on VLM2Vec! I have a quick question regarding the models you released.
According to the paper, the best-performing model on ImageNet-1K is LLaVA-1.6 finetuned with LoRA, which achieves top-1 accuracy of 0.745. However, the currently available TIGER-Lab/VLM2Vec-LLaVa-Next seems to be fully finetuned, as there's no adapter_config.json in the repo.
I evaluated this model (TIGER-Lab/VLM2Vec-LLaVa-Next) using the command below and obtained an ImageNet-1K accuracy of only 0.207, which is far from the reported result. Here's the command I used :
python eval.py \
--model_name TIGER-Lab/VLM2Vec-LLaVa-Next \
--model_backbone llava_next \
--encode_output_path llava_next_outputs/ \
--image_resolution high \
--num_crops 4 \
--max_len 256 \
--pooling last \
--normalize True \
--dataset_name TIGER-Lab/MMEB-eval \
--subset_name ImageNet-1K \
--dataset_split test \
--per_device_eval_batch_size 2 \
--image_dir eval_images/
In contrast, when I evaluated TIGER-Lab/VLM2Vec-LoRA on the lora setup, I got a 0.68 accuracy, which seems much closer to the expected performance.
Would it be possible to release the LLaVA-1.6 + LoRA model used in the paper, or provide instructions to reproduce it (e.g., adapter weights and configuration)?
Thanks again for your time and amazing work!
same problem. when i use this command to eval model, i only get the 0.015 and 0.029 on MSCOCO_i2t and ViusualNews_i2t
Thanks for letting me know. I will take a look soon and update here.
BTW, this is the model fine-tuned with LoRA, I merged it to the full model so that it will be more convenient for people to use.
Hi @yibingwei @LightSunKing , thanks a lot for bringing up this issue.
Regarding the low results, they were caused by the --max_len 256 parameter, which truncated the image tokens.
You can simply remove this parameter, and the results should then be reproducible.
This parameter can be a bit confusing. For some models' processors, it represents max_text_length, in which case it's fine to use. But for others, it refers to the combined length of image and text tokens, in which case it should be removed.
I'll update the documentation to clarify this and avoid future confusion. As a general rule, it's safer not to use this parameter.
Also, just FYI, our best-performing models are now the VLM2Vec_Qwen series (https://huggingface.co/collections/TIGER-Lab/vlm2vec-6705f418271d085836e0cdd5). We’ll also be releasing the VLM2Vec_v2 series of code and models later this week, which will offer even better performance.