Qwen2-VL-7B LoRA Adapter β€” COCO Image Captioning

A LoRA adapter fine-tuned on top of Qwen/Qwen2-VL-7B-Instruct for image captioning, trained on the COCO Karpathy train split.

Model Details

  • Base Model: Qwen/Qwen2-VL-7B-Instruct
  • Model type: LoRA Adapter (PEFT)
  • Task: Image Captioning
  • Training Data: COCO Karpathy Train Split
  • License: Apache 2.0
  • Framework: Transformers + PEFT 0.18.0

How to Get Started

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

# Load base model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "adalvi/qwen2vl-lora-coco")
model.eval()

# Load processor
processor = AutoProcessor.from_pretrained("adalvi/qwen2vl-lora-coco")

Captioning: Prompt & Inference Details

The base model (Qwen2VL Base) was evaluated using the prompt "Caption this image." with max_new_tokens=21.

from qwen_vl_utils import process_vision_info

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "<path_or_url_to_image>"},
            {"type": "text", "text": "Caption this image."},
        ],
    }
]

# Apply chat template
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Process image inputs
image_inputs, video_inputs = process_vision_info(messages)

# Tokenize
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs if video_inputs else None,
    padding=True,
    return_tensors="pt",
).to("cuda")

# Generate caption
generated_ids = model.generate(**inputs, max_new_tokens=21)

# Trim prompt tokens and decode
generated_ids_trimmed = [
    out_ids[len(in_ids):]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
caption = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(caption)

Evaluation Results

B@4 = BLEU-4, M = METEOR, C = CIDEr, S = SPICE, CLIP-Score and RefCLIP-Score

COCO Karpathy Test Split

Model B@4 M C S CLIP-S RefCLIP-S
Qwen2VL Base 16.9 26.0 47.1 20.3 81.0 81.9
Qwen2VL Fine-tuned (this adapter) 40.0 30.7 137.5 24.2 78.6 84.0

NoCaps Val Split (Zero-shot) β€” CIDEr & SPICE by Domain

Model In-C In-S Near-C Near-S Out-C Out-S Overall-C Overall-S CLIP-S
Qwen2VL Base 48.5 14.8 51.0 14.5 57.4 14.7 53.3 14.6 81.4
Qwen2VL Fine-tuned (this adapter) 118.4 15.3 120.0 15.6 123.1 15.7 122.3 15.6 79.2

Training Details

  • Training regime: bf16 mixed precision
  • PEFT version: 0.18.0

Citation

This adapter was trained for the paper below in order to provide a comparison baseline against the proposed method:

@misc{dalvi2026_HDFLIM,
  title={Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning}, 
  author={Abhishek Dalvi and Vasant Honavar},
  year={2026},
  eprint={2602.23588},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.23588}
}
Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for adalvi/qwen2vl-lora-coco

Base model

Qwen/Qwen2-VL-7B
Adapter
(209)
this model

Dataset used to train adalvi/qwen2vl-lora-coco

Paper for adalvi/qwen2vl-lora-coco