---
library_name: peft
license: mit
base_model: Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- SinaLab/ImageEval2025Task2TrainDataset
tags:
- arabic
- image-captioning
- vision-language
- lora
- qwen2.5-vl
- cultural-heritage
language:
- ar
model-index:
- name: arabic-image-captioning-qwen2.5vl
  results: []
---

# Arabic Image Captioning - Qwen2.5-VL Fine-tuned

This model is a LoRA fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for generating Arabic captions for images.

## Model Description

This model was developed as part of the [Arabic Image Captioning Shared Task 2025](https://sina.birzeit.edu/image_eval2025/index.html). It generates natural Arabic captions for images with focus on historical and cultural content related to Palestinian heritage.

please refer to the [training dataset](https://huggingface.co/datasets/SinaLab/ImageEval2025Task2TrainDataset) for more details.

## Usage

```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
from PIL import Image

# Load base model and processor
base_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "your-username/arabic-image-captioning-qwen2.5vl")

# Process image and generate caption
image = Image.open("your_image.jpg")
prompt = "اكتب وصفاً مختصراً لهذه الصورة باللغة العربية"

inputs = processor(images=image, text=prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=128)
    
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)
```

## Training Details

### Dataset
- **Training data**: Arabic image captions dataset from the shared task
- **Languages**: Arabic (ar)
- **Dataset size**: ~2,700 training images with Arabic captions

### Training Procedure
- **Fine-tuning method**: LoRA (Low-Rank Adaptation)
- **Training epochs**: 15
- **Learning rate**: 2e-05
- **Batch size**: 1 with gradient accumulation (effective batch size: 16)
- **Optimizer**: AdamW with cosine learning rate scheduling
- **Hardware**: NVIDIA A100 GPU
- **Training time**: ~6 hours

### Framework Versions
- PEFT 0.15.2
- Transformers 4.49.0
- PyTorch 2.4.1+cu121


## Contact

For questions or support:
- abashiti@birzeit.edu
- aaljabari@birzeit.edu  
- hhamoud@dohainstitute.edu.qa