File size: 7,530 Bytes
3768557
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d79d78d
3768557
236a9c5
3768557
d79d78d
3768557
d79d78d
 
 
 
 
 
 
6af865d
d79d78d
6af865d
d79d78d
 
 
 
 
6af865d
3768557
6af865d
d79d78d
6af865d
 
 
 
 
 
3768557
 
 
d79d78d
 
 
6af865d
3768557
 
 
d79d78d
 
 
 
 
 
 
 
 
 
 
 
 
 
6af865d
d79d78d
 
6af865d
d79d78d
 
 
 
 
 
 
 
6af865d
d79d78d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3768557
 
 
6af865d
3768557
d79d78d
3768557
 
 
 
 
d79d78d
3768557
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6af865d
 
d79d78d
 
6af865d
d79d78d
 
 
3768557
 
 
 
 
 
 
6af865d
3768557
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6af865d
3768557
 
 
 
6af865d
3768557
 
 
 
 
 
d79d78d
3768557
d79d78d
 
 
 
3768557
 
 
 
 
6af865d
3768557
6af865d
 
 
 
3768557
 
 
6af865d
3768557
6af865d
3768557
 
 
 
 
d79d78d
 
 
 
6af865d
3768557
6af865d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
license: apache-2.0
base_model: Qwen/Qwen3-VL-4B-Instruct
tags:
  - qwen3_vl
  - vision-language
  - multimodal
  - fine-tuned
  - qlora
  - safetensors
  - coding
  - design
language:
  - id
  - en
pipeline_tag: image-text-to-text
---

<div align="center">

<img src="https://snapgate.tech/img/snapgatelogo.jpg" alt="Snapgate Logo" width="120"/>

# ๐ŸŒ snapgate-VL-4B

### Vision-Language AI ยท Fine-tuned for Coding & Design

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Qwen3--VL--4B-orange)](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)
[![Language](https://img.shields.io/badge/Language-ID%20%7C%20EN-green)](https://huggingface.co/kadalicious22/snapgate-VL-4B)
[![Website](https://img.shields.io/badge/Website-snapgate.tech-purple)](https://snapgate.tech)

**snapgate-code-4B** is a multimodal vision-language model fine-tuned from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) using **QLoRA**, specifically optimized for **developers** and **designers** โ€” understanding both images and text with high precision.

*Developed by [Snapgate](https://snapgate.tech) ยท Made with โค๏ธ in Indonesia ๐Ÿ‡ฎ๐Ÿ‡ฉ*

</div>

---

## ๐Ÿง  Core Capabilities

| Capability | Description |
|-----------|-----------|
| ๐Ÿ’ป **Code Generation & Review** | Write, analyze, debug, and optimize code (Python, JS, TS, HTML/CSS, SQL, etc.) |
| ๐ŸŽจ **UI/UX Design Analysis** | Analyze interface screenshots, provide design suggestions, identify UX issues |
| ๐Ÿ–ผ๏ธ **Design to Code** | Convert mockups, wireframes, or UI screenshots into HTML/CSS/React/Tailwind code |
| ๐Ÿ—๏ธ **Diagram & Architecture** | Understand flowcharts, system architecture, ERDs, and technical diagrams |
| ๐Ÿ“ธ **Code from Image** | Read and explain code from screenshots or photos |
| ๐Ÿ“ **Technical Documentation** | Generate clear, structured, and professional technical documentation |

---

## ๐Ÿ”ง Training Configuration

<details>
<summary><b>Click to view training details</b></summary>

| Parameter | Value |
|-----------|-------|
| ๐Ÿค– Base Model | `Qwen/Qwen3-VL-4B-Instruct` |
| โš™๏ธ Method | QLoRA (4-bit NF4) |
| ๐Ÿ”ข LoRA Rank | 16 |
| ๐Ÿ”ข LoRA Alpha | 32 |
| ๐ŸŽฏ Target Modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` |
| ๐Ÿ”ข Trainable Params | 33,030,144 **(0.74% of total)** |
| ๐Ÿ”„ Epochs | 3 |
| ๐Ÿ“ถ Total Steps | 75 |
| ๐Ÿ“ˆ Learning Rate | `1e-4` |
| ๐Ÿ“ฆ Batch Size | 1 (grad accumulation: 8) |
| โšก Optimizer | `paged_adamw_8bit` |
| ๐ŸŽ›๏ธ Precision | `bfloat16` |
| ๐Ÿ–ฅ๏ธ Hardware | NVIDIA T4 ยท Google Colab |
| ๐Ÿ“ฆ Dataset | 200 samples internal Snapgate |
| ๐Ÿท๏ธ Categories | 10 categories ยท 20 samples each |
| ๐Ÿ“Š Format | ShareGPT |

**Dataset Categories:**
`code_generation` ยท `code_review` ยท `debugging` ยท `refactoring` ยท `ui_html_css` ยท `ui_react` ยท `ui_tailwind` ยท `design_system` ยท `ux_analysis` ยท `design_to_code`

</details>

---

## ๐Ÿ“Š Training Progress

Loss decreased consistently throughout training โ€” from **1.242 โ†’ 0.444** โœ…

```
Step  5  โ”‚โ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 1.242
Step 10  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.959
Step 15  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.808
Step 20  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.671
Step 25  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.544
Step 30  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.561
Step 35  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.513
Step 40  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.469
Step 45  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.448
Step 50  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.465
Step 55  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.453
Step 60  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.465
Step 65  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.465
Step 70  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.450
Step 75  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚  Loss: 0.444
```

---

## ๐Ÿš€ Usage

### 1. Install Dependencies

```bash
pip install transformers>=4.51.0 accelerate>=0.30.0 qwen-vl-utils
```

### 2. Load Model

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

model_id = "kadalicious22/snapgate-VL-4B"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

SYSTEM_PROMPT = """You are Snapgate AI, a multimodal AI assistant by Snapgate \
specialized in coding and UI/UX design."""
```

### 3. Inference with Image

```python
from qwen_vl_utils import process_vision_info

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/your/image.png"},
            {"type": "text", "text": "Analyze the UI from this image and generate its HTML/CSS code."},
        ],
    },
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(response)
```

### 4. Text-Only Inference

```python
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Write a Python function to validate email using regex."},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

response = processor.batch_decode(
    output_ids[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)[0]
print(response)
```

---

## โš ๏ธ Limitations

- ๐Ÿ“ฆ Trained on a relatively small internal Snapgate dataset (200 samples) โ€” performance will improve as more data is added
- ๐ŸŒ Optimized for Indonesian and English; other languages have not been tested
- ๐ŸŽฏ Best performance on coding and UI analysis tasks; less optimal for other domains (e.g., science, law, medicine)
- ๐Ÿ–ฅ๏ธ A GPU with at least 8GB VRAM is recommended for comfortable inference

---

## ๐Ÿ“„ License

Released under the **Apache 2.0** license, following the base model license of [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct).

---

## ๐Ÿ”— Links

| | |
|---|---|
| ๐ŸŒ Website | [snapgate.tech](https://snapgate.tech) |
| ๐Ÿค— Base Model | [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
| ๐Ÿ“ง Contact | Via Snapgate website |

---