File size: 7,530 Bytes
3768557 d79d78d 3768557 236a9c5 3768557 d79d78d 3768557 d79d78d 6af865d d79d78d 6af865d d79d78d 6af865d 3768557 6af865d d79d78d 6af865d 3768557 d79d78d 6af865d 3768557 d79d78d 6af865d d79d78d 6af865d d79d78d 6af865d d79d78d 3768557 6af865d 3768557 d79d78d 3768557 d79d78d 3768557 6af865d d79d78d 6af865d d79d78d 3768557 6af865d 3768557 6af865d 3768557 6af865d 3768557 d79d78d 3768557 d79d78d 3768557 6af865d 3768557 6af865d 3768557 6af865d 3768557 6af865d 3768557 d79d78d 6af865d 3768557 6af865d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | ---
license: apache-2.0
base_model: Qwen/Qwen3-VL-4B-Instruct
tags:
- qwen3_vl
- vision-language
- multimodal
- fine-tuned
- qlora
- safetensors
- coding
- design
language:
- id
- en
pipeline_tag: image-text-to-text
---
<div align="center">
<img src="https://snapgate.tech/img/snapgatelogo.jpg" alt="Snapgate Logo" width="120"/>
# ๐ snapgate-VL-4B
### Vision-Language AI ยท Fine-tuned for Coding & Design
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)
[](https://huggingface.co/kadalicious22/snapgate-VL-4B)
[](https://snapgate.tech)
**snapgate-code-4B** is a multimodal vision-language model fine-tuned from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) using **QLoRA**, specifically optimized for **developers** and **designers** โ understanding both images and text with high precision.
*Developed by [Snapgate](https://snapgate.tech) ยท Made with โค๏ธ in Indonesia ๐ฎ๐ฉ*
</div>
---
## ๐ง Core Capabilities
| Capability | Description |
|-----------|-----------|
| ๐ป **Code Generation & Review** | Write, analyze, debug, and optimize code (Python, JS, TS, HTML/CSS, SQL, etc.) |
| ๐จ **UI/UX Design Analysis** | Analyze interface screenshots, provide design suggestions, identify UX issues |
| ๐ผ๏ธ **Design to Code** | Convert mockups, wireframes, or UI screenshots into HTML/CSS/React/Tailwind code |
| ๐๏ธ **Diagram & Architecture** | Understand flowcharts, system architecture, ERDs, and technical diagrams |
| ๐ธ **Code from Image** | Read and explain code from screenshots or photos |
| ๐ **Technical Documentation** | Generate clear, structured, and professional technical documentation |
---
## ๐ง Training Configuration
<details>
<summary><b>Click to view training details</b></summary>
| Parameter | Value |
|-----------|-------|
| ๐ค Base Model | `Qwen/Qwen3-VL-4B-Instruct` |
| โ๏ธ Method | QLoRA (4-bit NF4) |
| ๐ข LoRA Rank | 16 |
| ๐ข LoRA Alpha | 32 |
| ๐ฏ Target Modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` |
| ๐ข Trainable Params | 33,030,144 **(0.74% of total)** |
| ๐ Epochs | 3 |
| ๐ถ Total Steps | 75 |
| ๐ Learning Rate | `1e-4` |
| ๐ฆ Batch Size | 1 (grad accumulation: 8) |
| โก Optimizer | `paged_adamw_8bit` |
| ๐๏ธ Precision | `bfloat16` |
| ๐ฅ๏ธ Hardware | NVIDIA T4 ยท Google Colab |
| ๐ฆ Dataset | 200 samples internal Snapgate |
| ๐ท๏ธ Categories | 10 categories ยท 20 samples each |
| ๐ Format | ShareGPT |
**Dataset Categories:**
`code_generation` ยท `code_review` ยท `debugging` ยท `refactoring` ยท `ui_html_css` ยท `ui_react` ยท `ui_tailwind` ยท `design_system` ยท `ux_analysis` ยท `design_to_code`
</details>
---
## ๐ Training Progress
Loss decreased consistently throughout training โ from **1.242 โ 0.444** โ
```
Step 5 โโโโโโโโโโโโโโโโโโโโโโ Loss: 1.242
Step 10 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.959
Step 15 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.808
Step 20 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.671
Step 25 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.544
Step 30 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.561
Step 35 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.513
Step 40 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.469
Step 45 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.448
Step 50 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.465
Step 55 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.453
Step 60 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.465
Step 65 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.465
Step 70 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.450
Step 75 โโโโโโโโโโโโโโโโโโโโโโ Loss: 0.444
```
---
## ๐ Usage
### 1. Install Dependencies
```bash
pip install transformers>=4.51.0 accelerate>=0.30.0 qwen-vl-utils
```
### 2. Load Model
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
model_id = "kadalicious22/snapgate-VL-4B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
SYSTEM_PROMPT = """You are Snapgate AI, a multimodal AI assistant by Snapgate \
specialized in coding and UI/UX design."""
```
### 3. Inference with Image
```python
from qwen_vl_utils import process_vision_info
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/your/image.png"},
{"type": "text", "text": "Analyze the UI from this image and generate its HTML/CSS code."},
],
},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(response)
```
### 4. Text-Only Inference
```python
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Write a Python function to validate email using regex."},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
response = processor.batch_decode(
output_ids[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)[0]
print(response)
```
---
## โ ๏ธ Limitations
- ๐ฆ Trained on a relatively small internal Snapgate dataset (200 samples) โ performance will improve as more data is added
- ๐ Optimized for Indonesian and English; other languages have not been tested
- ๐ฏ Best performance on coding and UI analysis tasks; less optimal for other domains (e.g., science, law, medicine)
- ๐ฅ๏ธ A GPU with at least 8GB VRAM is recommended for comfortable inference
---
## ๐ License
Released under the **Apache 2.0** license, following the base model license of [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct).
---
## ๐ Links
| | |
|---|---|
| ๐ Website | [snapgate.tech](https://snapgate.tech) |
| ๐ค Base Model | [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
| ๐ง Contact | Via Snapgate website |
--- |