---
license: apache-2.0
base_model: Qwen/Qwen3-VL-4B-Instruct
tags:
  - qwen3_vl
  - vision-language
  - multimodal
  - fine-tuned
  - qlora
  - safetensors
  - coding
  - design
language:
  - id
  - en
pipeline_tag: image-text-to-text
---

<div align="center">

<img src="https://snapgate.tech/img/snapgatelogo.jpg" alt="Snapgate Logo" width="120"/>

# 🌐 snapgate-VL-4B

### Vision-Language AI · Fine-tuned for Coding & Design

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Qwen3--VL--4B-orange)](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)
[![Language](https://img.shields.io/badge/Language-ID%20%7C%20EN-green)](https://huggingface.co/kadalicious22/snapgate-VL-4B)
[![Website](https://img.shields.io/badge/Website-snapgate.tech-purple)](https://snapgate.tech)

**snapgate-code-4B** is a multimodal vision-language model fine-tuned from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) using **QLoRA**, specifically optimized for **developers** and **designers** — understanding both images and text with high precision.

*Developed by [Snapgate](https://snapgate.tech) · Made with ❤️ in Indonesia 🇮🇩*

</div>

---

## 🧠 Core Capabilities

| Capability | Description |
|-----------|-----------|
| 💻 **Code Generation & Review** | Write, analyze, debug, and optimize code (Python, JS, TS, HTML/CSS, SQL, etc.) |
| 🎨 **UI/UX Design Analysis** | Analyze interface screenshots, provide design suggestions, identify UX issues |
| 🖼️ **Design to Code** | Convert mockups, wireframes, or UI screenshots into HTML/CSS/React/Tailwind code |
| 🏗️ **Diagram & Architecture** | Understand flowcharts, system architecture, ERDs, and technical diagrams |
| 📸 **Code from Image** | Read and explain code from screenshots or photos |
| 📝 **Technical Documentation** | Generate clear, structured, and professional technical documentation |

---

## 🔧 Training Configuration

<details>
<summary><b>Click to view training details</b></summary>

| Parameter | Value |
|-----------|-------|
| 🤖 Base Model | `Qwen/Qwen3-VL-4B-Instruct` |
| ⚙️ Method | QLoRA (4-bit NF4) |
| 🔢 LoRA Rank | 16 |
| 🔢 LoRA Alpha | 32 |
| 🎯 Target Modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` |
| 🔢 Trainable Params | 33,030,144 **(0.74% of total)** |
| 🔄 Epochs | 3 |
| 📶 Total Steps | 75 |
| 📈 Learning Rate | `1e-4` |
| 📦 Batch Size | 1 (grad accumulation: 8) |
| ⚡ Optimizer | `paged_adamw_8bit` |
| 🎛️ Precision | `bfloat16` |
| 🖥️ Hardware | NVIDIA T4 · Google Colab |
| 📦 Dataset | 200 samples internal Snapgate |
| 🏷️ Categories | 10 categories · 20 samples each |
| 📊 Format | ShareGPT |

**Dataset Categories:**
`code_generation` · `code_review` · `debugging` · `refactoring` · `ui_html_css` · `ui_react` · `ui_tailwind` · `design_system` · `ux_analysis` · `design_to_code`

</details>

---

## 📊 Training Progress

Loss decreased consistently throughout training — from **1.242 → 0.444** ✅

```
Step  5  │███░░░░░░░░░░░░░░░░░│  Loss: 1.242
Step 10  │██████░░░░░░░░░░░░░░│  Loss: 0.959
Step 15  │████████░░░░░░░░░░░░│  Loss: 0.808
Step 20  │██████████░░░░░░░░░░│  Loss: 0.671
Step 25  │████████████░░░░░░░░│  Loss: 0.544
Step 30  │████████████░░░░░░░░│  Loss: 0.561
Step 35  │█████████████░░░░░░░│  Loss: 0.513
Step 40  │█████████████░░░░░░░│  Loss: 0.469
Step 45  │██████████████░░░░░░│  Loss: 0.448
Step 50  │██████████████░░░░░░│  Loss: 0.465
Step 55  │██████████████░░░░░░│  Loss: 0.453
Step 60  │██████████████░░░░░░│  Loss: 0.465
Step 65  │██████████████░░░░░░│  Loss: 0.465
Step 70  │██████████████░░░░░░│  Loss: 0.450
Step 75  │██████████████░░░░░░│  Loss: 0.444
```

---

## 🚀 Usage

### 1. Install Dependencies

```bash
pip install transformers>=4.51.0 accelerate>=0.30.0 qwen-vl-utils
```

### 2. Load Model

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

model_id = "kadalicious22/snapgate-VL-4B"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

SYSTEM_PROMPT = """You are Snapgate AI, a multimodal AI assistant by Snapgate \
specialized in coding and UI/UX design."""
```

### 3. Inference with Image

```python
from qwen_vl_utils import process_vision_info

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/your/image.png"},
            {"type": "text", "text": "Analyze the UI from this image and generate its HTML/CSS code."},
        ],
    },
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(response)
```

### 4. Text-Only Inference

```python
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Write a Python function to validate email using regex."},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

response = processor.batch_decode(
    output_ids[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)[0]
print(response)
```

---

## ⚠️ Limitations

- 📦 Trained on a relatively small internal Snapgate dataset (200 samples) — performance will improve as more data is added
- 🌏 Optimized for Indonesian and English; other languages have not been tested
- 🎯 Best performance on coding and UI analysis tasks; less optimal for other domains (e.g., science, law, medicine)
- 🖥️ A GPU with at least 8GB VRAM is recommended for comfortable inference

---

## 📄 License

Released under the **Apache 2.0** license, following the base model license of [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct).

---

## 🔗 Links

| | |
|---|---|
| 🌐 Website | [snapgate.tech](https://snapgate.tech) |
| 🤗 Base Model | [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
| 📧 Contact | Via Snapgate website |

---