File size: 4,367 Bytes
3e8ca0e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Instruct
tags:
- qwen3-vl
- vision-language
- lora
- fine-tuned
library_name: peft
---
# qwen3vl-8b-lora
This is a LoRA adapter fine-tuned on top of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).
## Model Description
This model is a fine-tuned version of Qwen3-VL-8B-Instruct using LoRA (Low-Rank Adaptation) for efficient training.
The adapter weights can be merged with the base model for inference.
## Training Details
### Base Model
- **Model:** Qwen/Qwen3-VL-8B-Instruct
- **Architecture:** Vision-Language Model (VLM)
### LoRA Configuration
- **Rank (r):** 64
- **Alpha:** 128
- **Dropout:** 0.05
- **Target Modules:** q_proj, k_proj, v_proj, o_proj
- **Task Type:** Causal Language Modeling
### Training Hyperparameters
- **Learning Rate:** 1e-5
- **Batch Size:** 4 (per device)
- **Gradient Accumulation Steps:** 4
- **Epochs:** 2
- **Optimizer:** AdamW
- **Weight Decay:** 0
- **Warmup Ratio:** 0.03
- **LR Scheduler:** Cosine
- **Max Gradient Norm:** 1.0
- **Model Max Length:** 40960
- **Max Pixels:** 250880
- **Min Pixels:** 784
### Training Infrastructure
- **Framework:** PyTorch + DeepSpeed (ZeRO Stage 2)
- **Precision:** BF16
- **Gradient Checkpointing:** Enabled
## Usage
### Requirements
```bash
pip install transformers peft torch pillow qwen-vl-utils
```
### Loading the Model
```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"openhay/qwen3vl-8b-lora",
torch_dtype=torch.bfloat16
)
# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")
```
### Inference Example
```python
from qwen_vl_utils import process_vision_info
from PIL import Image
# Prepare messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
# Prepare for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Generate
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text[0])
```
### Merging LoRA Weights (Optional)
If you want to merge the LoRA weights into the base model for faster inference:
```python
from transformers import Qwen2VLForConditionalGeneration
from peft import PeftModel
# Load base model and adapter
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "openhay/qwen3vl-8b-lora")
# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
```
## Limitations
- This model inherits all limitations from the base Qwen3-VL-8B-Instruct model
- Performance depends on the quality and domain of the fine-tuning dataset
- LoRA adapters may not capture all nuances that full fine-tuning would achieve
## Citation
If you use this model, please cite:
```bibtex
@misc{qwen3vl_8b_lora,
author = {OpenHay},
title = {qwen3vl-8b-lora},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/openhay/qwen3vl-8b-lora}}
}
```
## Acknowledgements
- Base model: [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) by Alibaba Cloud
- Training framework: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) or similar
- LoRA implementation: [PEFT](https://github.com/huggingface/peft) by Hugging Face
|