--- license: apache-2.0 base_model: Qwen/Qwen3-VL-4B-Instruct tags: - qwen3_vl - vision-language - multimodal - fine-tuned - qlora - safetensors - coding - design language: - id - en pipeline_tag: image-text-to-text ---
Snapgate Logo # ๐ŸŒ snapgate-VL-4B ### Vision-Language AI ยท Fine-tuned for Coding & Design [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Base Model](https://img.shields.io/badge/Base-Qwen3--VL--4B-orange)](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) [![Language](https://img.shields.io/badge/Language-ID%20%7C%20EN-green)](https://huggingface.co/kadalicious22/snapgate-VL-4B) [![Website](https://img.shields.io/badge/Website-snapgate.tech-purple)](https://snapgate.tech) **snapgate-code-4B** is a multimodal vision-language model fine-tuned from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) using **QLoRA**, specifically optimized for **developers** and **designers** โ€” understanding both images and text with high precision. *Developed by [Snapgate](https://snapgate.tech) ยท Made with โค๏ธ in Indonesia ๐Ÿ‡ฎ๐Ÿ‡ฉ*
--- ## ๐Ÿง  Core Capabilities | Capability | Description | |-----------|-----------| | ๐Ÿ’ป **Code Generation & Review** | Write, analyze, debug, and optimize code (Python, JS, TS, HTML/CSS, SQL, etc.) | | ๐ŸŽจ **UI/UX Design Analysis** | Analyze interface screenshots, provide design suggestions, identify UX issues | | ๐Ÿ–ผ๏ธ **Design to Code** | Convert mockups, wireframes, or UI screenshots into HTML/CSS/React/Tailwind code | | ๐Ÿ—๏ธ **Diagram & Architecture** | Understand flowcharts, system architecture, ERDs, and technical diagrams | | ๐Ÿ“ธ **Code from Image** | Read and explain code from screenshots or photos | | ๐Ÿ“ **Technical Documentation** | Generate clear, structured, and professional technical documentation | --- ## ๐Ÿ”ง Training Configuration
Click to view training details | Parameter | Value | |-----------|-------| | ๐Ÿค– Base Model | `Qwen/Qwen3-VL-4B-Instruct` | | โš™๏ธ Method | QLoRA (4-bit NF4) | | ๐Ÿ”ข LoRA Rank | 16 | | ๐Ÿ”ข LoRA Alpha | 32 | | ๐ŸŽฏ Target Modules | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` | | ๐Ÿ”ข Trainable Params | 33,030,144 **(0.74% of total)** | | ๐Ÿ”„ Epochs | 3 | | ๐Ÿ“ถ Total Steps | 75 | | ๐Ÿ“ˆ Learning Rate | `1e-4` | | ๐Ÿ“ฆ Batch Size | 1 (grad accumulation: 8) | | โšก Optimizer | `paged_adamw_8bit` | | ๐ŸŽ›๏ธ Precision | `bfloat16` | | ๐Ÿ–ฅ๏ธ Hardware | NVIDIA T4 ยท Google Colab | | ๐Ÿ“ฆ Dataset | 200 samples internal Snapgate | | ๐Ÿท๏ธ Categories | 10 categories ยท 20 samples each | | ๐Ÿ“Š Format | ShareGPT | **Dataset Categories:** `code_generation` ยท `code_review` ยท `debugging` ยท `refactoring` ยท `ui_html_css` ยท `ui_react` ยท `ui_tailwind` ยท `design_system` ยท `ux_analysis` ยท `design_to_code`
--- ## ๐Ÿ“Š Training Progress Loss decreased consistently throughout training โ€” from **1.242 โ†’ 0.444** โœ… ``` Step 5 โ”‚โ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 1.242 Step 10 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.959 Step 15 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.808 Step 20 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.671 Step 25 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.544 Step 30 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.561 Step 35 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.513 Step 40 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.469 Step 45 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.448 Step 50 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.465 Step 55 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.453 Step 60 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.465 Step 65 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.465 Step 70 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.450 Step 75 โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ Loss: 0.444 ``` --- ## ๐Ÿš€ Usage ### 1. Install Dependencies ```bash pip install transformers>=4.51.0 accelerate>=0.30.0 qwen-vl-utils ``` ### 2. Load Model ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor import torch model_id = "kadalicious22/snapgate-VL-4B" processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = Qwen3VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) SYSTEM_PROMPT = """You are Snapgate AI, a multimodal AI assistant by Snapgate \ specialized in coding and UI/UX design.""" ``` ### 3. Inference with Image ```python from qwen_vl_utils import process_vision_info messages = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": [ {"type": "image", "image": "path/to/your/image.png"}, {"type": "text", "text": "Analyze the UI from this image and generate its HTML/CSS code."}, ], }, ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt", ).to(model.device) with torch.no_grad(): output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False) generated = output_ids[:, inputs["input_ids"].shape[1]:] response = processor.batch_decode(generated, skip_special_tokens=True)[0] print(response) ``` ### 4. Text-Only Inference ```python messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": "Write a Python function to validate email using regex."}, ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], return_tensors="pt").to(model.device) with torch.no_grad(): output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False) response = processor.batch_decode( output_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True )[0] print(response) ``` --- ## โš ๏ธ Limitations - ๐Ÿ“ฆ Trained on a relatively small internal Snapgate dataset (200 samples) โ€” performance will improve as more data is added - ๐ŸŒ Optimized for Indonesian and English; other languages have not been tested - ๐ŸŽฏ Best performance on coding and UI analysis tasks; less optimal for other domains (e.g., science, law, medicine) - ๐Ÿ–ฅ๏ธ A GPU with at least 8GB VRAM is recommended for comfortable inference --- ## ๐Ÿ“„ License Released under the **Apache 2.0** license, following the base model license of [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct). --- ## ๐Ÿ”— Links | | | |---|---| | ๐ŸŒ Website | [snapgate.tech](https://snapgate.tech) | | ๐Ÿค— Base Model | [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) | | ๐Ÿ“ง Contact | Via Snapgate website | ---