Instructions to use adalvi/qwen2vl-lora-coco with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adalvi/qwen2vl-lora-coco with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "adalvi/qwen2vl-lora-coco")

Transformers

How to use adalvi/qwen2vl-lora-coco with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="adalvi/qwen2vl-lora-coco")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("adalvi/qwen2vl-lora-coco", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use adalvi/qwen2vl-lora-coco with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adalvi/qwen2vl-lora-coco"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adalvi/qwen2vl-lora-coco",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/adalvi/qwen2vl-lora-coco

SGLang

How to use adalvi/qwen2vl-lora-coco with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "adalvi/qwen2vl-lora-coco" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adalvi/qwen2vl-lora-coco",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "adalvi/qwen2vl-lora-coco" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adalvi/qwen2vl-lora-coco",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use adalvi/qwen2vl-lora-coco with Docker Model Runner:
```
docker model run hf.co/adalvi/qwen2vl-lora-coco
```

Qwen2-VL-7B LoRA Adapter — COCO Image Captioning

A LoRA adapter fine-tuned on top of Qwen/Qwen2-VL-7B-Instruct for image captioning, trained on the COCO Karpathy train split.

Model Details

Base Model: Qwen/Qwen2-VL-7B-Instruct
Model type: LoRA Adapter (PEFT)
Task: Image Captioning
Training Data: COCO Karpathy Train Split
License: Apache 2.0
Framework: Transformers + PEFT 0.18.0

How to Get Started

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

# Load base model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "adalvi/qwen2vl-lora-coco")
model.eval()

# Load processor
processor = AutoProcessor.from_pretrained("adalvi/qwen2vl-lora-coco")

Captioning: Prompt & Inference Details

The base model (Qwen2VL Base) was evaluated using the prompt "Caption this image." with max_new_tokens=21.

from qwen_vl_utils import process_vision_info

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "<path_or_url_to_image>"},
            {"type": "text", "text": "Caption this image."},
        ],
    }
]

# Apply chat template
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Process image inputs
image_inputs, video_inputs = process_vision_info(messages)

# Tokenize
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs if video_inputs else None,
    padding=True,
    return_tensors="pt",
).to("cuda")

# Generate caption
generated_ids = model.generate(**inputs, max_new_tokens=21)

# Trim prompt tokens and decode
generated_ids_trimmed = [
    out_ids[len(in_ids):]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
caption = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(caption)

Evaluation Results

B@4 = BLEU-4, M = METEOR, C = CIDEr, S = SPICE, CLIP-Score and RefCLIP-Score

COCO Karpathy Test Split

Model	B@4	M	C	S	CLIP-S	RefCLIP-S
Qwen2VL Base	16.9	26.0	47.1	20.3	81.0	81.9
Qwen2VL Fine-tuned (this adapter)	40.0	30.7	137.5	24.2	78.6	84.0

NoCaps Val Split (Zero-shot) — CIDEr & SPICE by Domain

Model	In-C	In-S	Near-C	Near-S	Out-C	Out-S	Overall-C	Overall-S	CLIP-S
Qwen2VL Base	48.5	14.8	51.0	14.5	57.4	14.7	53.3	14.6	81.4
Qwen2VL Fine-tuned (this adapter)	118.4	15.3	120.0	15.6	123.1	15.7	122.3	15.6	79.2

Training Details

Training regime: bf16 mixed precision
PEFT version: 0.18.0

Citation

This adapter was trained for the paper below in order to provide a comparison baseline against the proposed method:

@misc{dalvi2026_HDFLIM,
  title={Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning}, 
  author={Abhishek Dalvi and Vasant Honavar},
  year={2026},
  eprint={2602.23588},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.23588}
}

Downloads last month: 1

Model tree for adalvi/qwen2vl-lora-coco

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct

Adapter

(208)

this model

Dataset used to train adalvi/qwen2vl-lora-coco

Paper for adalvi/qwen2vl-lora-coco

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

Paper • 2602.23588 • Published Feb 27