| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-VL-4B-Instruct |
| tags: |
| - qwen3_vl |
| - vision-language |
| - multimodal |
| - fine-tuned |
| - qlora |
| - safetensors |
| - coding |
| - design |
| language: |
| - id |
| - en |
| pipeline_tag: image-text-to-text |
| --- |
| |
| <div align="center"> |
|
|
| <img src="https://snapgate.tech/img/snapgatelogo.jpg" alt="Snapgate Logo" width="120"/> |
|
|
| # 🌐 snapgate-VL-4B |
|
|
| ### Vision-Language AI · Fine-tuned for Coding & Design |
|
|
| [](https://opensource.org/licenses/Apache-2.0) |
| [](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
| [](https://huggingface.co/kadalicious22/snapgate-VL-4B) |
| [](https://snapgate.tech) |
|
|
| **snapgate-code-4B** is a multimodal vision-language model fine-tuned from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) using **QLoRA**, specifically optimized for **developers** and **designers** — understanding both images and text with high precision. |
|
|
| *Developed by [Snapgate](https://snapgate.tech) · Made with ❤️ in Indonesia 🇮🇩* |
|
|
| </div> |
|
|
| --- |
|
|
| ## 🧠 Core Capabilities |
|
|
| | Capability | Description | |
| |-----------|-----------| |
| | 💻 **Code Generation & Review** | Write, analyze, debug, and optimize code (Python, JS, TS, HTML/CSS, SQL, etc.) | |
| | 🎨 **UI/UX Design Analysis** | Analyze interface screenshots, provide design suggestions, identify UX issues | |
| | 🖼️ **Design to Code** | Convert mockups, wireframes, or UI screenshots into HTML/CSS/React/Tailwind code | |
| | 🏗️ **Diagram & Architecture** | Understand flowcharts, system architecture, ERDs, and technical diagrams | |
| | 📸 **Code from Image** | Read and explain code from screenshots or photos | |
| | 📝 **Technical Documentation** | Generate clear, structured, and professional technical documentation | |
|
|
| --- |
|
|
| ## 🔧 Training Configuration |
|
|
| <details> |
| <summary><b>Click to view training details</b></summary> |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | 🤖 Base Model | `Qwen/Qwen3-VL-4B-Instruct` | |
| | ⚙️ Method | QLoRA (4-bit NF4) | |
| | 🔢 LoRA Rank | 16 | |
| | 🔢 LoRA Alpha | 32 | |
| | 🎯 Target Modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` | |
| | 🔢 Trainable Params | 33,030,144 **(0.74% of total)** | |
| | 🔄 Epochs | 3 | |
| | 📶 Total Steps | 75 | |
| | 📈 Learning Rate | `1e-4` | |
| | 📦 Batch Size | 1 (grad accumulation: 8) | |
| | ⚡ Optimizer | `paged_adamw_8bit` | |
| | 🎛️ Precision | `bfloat16` | |
| | 🖥️ Hardware | NVIDIA T4 · Google Colab | |
| | 📦 Dataset | 200 samples internal Snapgate | |
| | 🏷️ Categories | 10 categories · 20 samples each | |
| | 📊 Format | ShareGPT | |
|
|
| **Dataset Categories:** |
| `code_generation` · `code_review` · `debugging` · `refactoring` · `ui_html_css` · `ui_react` · `ui_tailwind` · `design_system` · `ux_analysis` · `design_to_code` |
|
|
| </details> |
|
|
| --- |
|
|
| ## 📊 Training Progress |
|
|
| Loss decreased consistently throughout training — from **1.242 → 0.444** ✅ |
|
|
| ``` |
| Step 5 │███░░░░░░░░░░░░░░░░░│ Loss: 1.242 |
| Step 10 │██████░░░░░░░░░░░░░░│ Loss: 0.959 |
| Step 15 │████████░░░░░░░░░░░░│ Loss: 0.808 |
| Step 20 │██████████░░░░░░░░░░│ Loss: 0.671 |
| Step 25 │████████████░░░░░░░░│ Loss: 0.544 |
| Step 30 │████████████░░░░░░░░│ Loss: 0.561 |
| Step 35 │█████████████░░░░░░░│ Loss: 0.513 |
| Step 40 │█████████████░░░░░░░│ Loss: 0.469 |
| Step 45 │██████████████░░░░░░│ Loss: 0.448 |
| Step 50 │██████████████░░░░░░│ Loss: 0.465 |
| Step 55 │██████████████░░░░░░│ Loss: 0.453 |
| Step 60 │██████████████░░░░░░│ Loss: 0.465 |
| Step 65 │██████████████░░░░░░│ Loss: 0.465 |
| Step 70 │██████████████░░░░░░│ Loss: 0.450 |
| Step 75 │██████████████░░░░░░│ Loss: 0.444 |
| ``` |
|
|
| --- |
|
|
| ## 🚀 Usage |
|
|
| ### 1. Install Dependencies |
|
|
| ```bash |
| pip install transformers>=4.51.0 accelerate>=0.30.0 qwen-vl-utils |
| ``` |
|
|
| ### 2. Load Model |
|
|
| ```python |
| from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
| import torch |
| |
| model_id = "kadalicious22/snapgate-VL-4B" |
| |
| processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
| model = Qwen3VLForConditionalGeneration.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| SYSTEM_PROMPT = """You are Snapgate AI, a multimodal AI assistant by Snapgate \ |
| specialized in coding and UI/UX design.""" |
| ``` |
|
|
| ### 3. Inference with Image |
|
|
| ```python |
| from qwen_vl_utils import process_vision_info |
| |
| messages = [ |
| {"role": "system", "content": SYSTEM_PROMPT}, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": "path/to/your/image.png"}, |
| {"type": "text", "text": "Analyze the UI from this image and generate its HTML/CSS code."}, |
| ], |
| }, |
| ] |
| |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| image_inputs, video_inputs = process_vision_info(messages) |
| inputs = processor( |
| text=[text], |
| images=image_inputs, |
| videos=video_inputs, |
| return_tensors="pt", |
| ).to(model.device) |
| |
| with torch.no_grad(): |
| output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False) |
| |
| generated = output_ids[:, inputs["input_ids"].shape[1]:] |
| response = processor.batch_decode(generated, skip_special_tokens=True)[0] |
| print(response) |
| ``` |
|
|
| ### 4. Text-Only Inference |
|
|
| ```python |
| messages = [ |
| {"role": "system", "content": SYSTEM_PROMPT}, |
| {"role": "user", "content": "Write a Python function to validate email using regex."}, |
| ] |
| |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = processor(text=[text], return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False) |
| |
| response = processor.batch_decode( |
| output_ids[:, inputs["input_ids"].shape[1]:], |
| skip_special_tokens=True |
| )[0] |
| print(response) |
| ``` |
|
|
| --- |
|
|
| ## ⚠️ Limitations |
|
|
| - 📦 Trained on a relatively small internal Snapgate dataset (200 samples) — performance will improve as more data is added |
| - 🌏 Optimized for Indonesian and English; other languages have not been tested |
| - 🎯 Best performance on coding and UI analysis tasks; less optimal for other domains (e.g., science, law, medicine) |
| - 🖥️ A GPU with at least 8GB VRAM is recommended for comfortable inference |
|
|
| --- |
|
|
| ## 📄 License |
|
|
| Released under the **Apache 2.0** license, following the base model license of [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct). |
|
|
| --- |
|
|
| ## 🔗 Links |
|
|
| | | | |
| |---|---| |
| | 🌐 Website | [snapgate.tech](https://snapgate.tech) | |
| | 🤗 Base Model | [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) | |
| | 📧 Contact | Via Snapgate website | |
|
|
| --- |