File size: 4,367 Bytes
3e8ca0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Instruct
tags:
- qwen3-vl
- vision-language
- lora
- fine-tuned
library_name: peft
---

# qwen3vl-8b-lora

This is a LoRA adapter fine-tuned on top of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct).

## Model Description

This model is a fine-tuned version of Qwen3-VL-8B-Instruct using LoRA (Low-Rank Adaptation) for efficient training.
The adapter weights can be merged with the base model for inference.

## Training Details

### Base Model
- **Model:** Qwen/Qwen3-VL-8B-Instruct
- **Architecture:** Vision-Language Model (VLM)

### LoRA Configuration
- **Rank (r):** 64
- **Alpha:** 128
- **Dropout:** 0.05
- **Target Modules:** q_proj, k_proj, v_proj, o_proj
- **Task Type:** Causal Language Modeling

### Training Hyperparameters
- **Learning Rate:** 1e-5
- **Batch Size:** 4 (per device)
- **Gradient Accumulation Steps:** 4
- **Epochs:** 2
- **Optimizer:** AdamW
- **Weight Decay:** 0
- **Warmup Ratio:** 0.03
- **LR Scheduler:** Cosine
- **Max Gradient Norm:** 1.0
- **Model Max Length:** 40960
- **Max Pixels:** 250880
- **Min Pixels:** 784

### Training Infrastructure
- **Framework:** PyTorch + DeepSpeed (ZeRO Stage 2)
- **Precision:** BF16
- **Gradient Checkpointing:** Enabled

## Usage

### Requirements

```bash
pip install transformers peft torch pillow qwen-vl-utils
```

### Loading the Model

```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

# Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "openhay/qwen3vl-8b-lora",
    torch_dtype=torch.bfloat16
)

# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")
```

### Inference Example

```python
from qwen_vl_utils import process_vision_info
from PIL import Image

# Prepare messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

# Prepare for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generate
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512)
    
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)

print(output_text[0])
```

### Merging LoRA Weights (Optional)

If you want to merge the LoRA weights into the base model for faster inference:

```python
from transformers import Qwen2VLForConditionalGeneration
from peft import PeftModel

# Load base model and adapter
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "openhay/qwen3vl-8b-lora")

# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
```

## Limitations

- This model inherits all limitations from the base Qwen3-VL-8B-Instruct model
- Performance depends on the quality and domain of the fine-tuning dataset
- LoRA adapters may not capture all nuances that full fine-tuning would achieve

## Citation

If you use this model, please cite:

```bibtex
@misc{qwen3vl_8b_lora,
  author = {OpenHay},
  title = {qwen3vl-8b-lora},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/openhay/qwen3vl-8b-lora}}
}
```

## Acknowledgements

- Base model: [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) by Alibaba Cloud
- Training framework: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) or similar
- LoRA implementation: [PEFT](https://github.com/huggingface/peft) by Hugging Face