--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen3.5-4B tags: - vision - vlm - agent - mobile-ui - react-native - tool-use - qwen3.5 - lora - fine-tuned pipeline_tag: image-text-to-text language: - en --- # Stone Preview 4B — Multimodal UI Agent A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.

Sample outputs from Stone Preview 4B
Screens built by the model from reference screenshots — Nike, How We Feel, Vivid

## Model Details | | | |---|---| | **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) | | **Architecture** | `Qwen3_5ForConditionalGeneration` — 32 layers, 2560 hidden, 16 attention heads | | **Parameters** | ~4B (bfloat16) | | **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 | | **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback | | **Context** | 32,768 tokens max | | **Hardware** | 4x H200 SXM | | **Format** | Merged safetensors (LoRA folded into base weights) | ## What the Model Does The model has learned two core behaviors: 1. **Visual reasoning** — analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy 2. **Tool-call emission** — given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches. ### Tool Call Format The model emits tool calls as inline XML in its responses: ```xml app/(flows)/flow-001/screen-001.tsx import React from 'react'; import { View, Text, StyleSheet } from 'react-native'; // ... component code ``` Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch` ## Usage ### With vLLM (recommended) ```bash vllm serve Scrymore/stone-preview-4b \ --dtype bfloat16 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_xml ``` ```python from openai import OpenAI import base64 client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") # Encode reference screenshot with open("reference.png", "rb") as f: img_b64 = base64.b64encode(f.read()).decode() response = client.chat.completions.create( model="Scrymore/stone-preview-4b", messages=[ { "role": "system", "content": ( "You are a React Native (Expo) UI engineer. Given a reference " "mobile-app screenshot, you build the matching screen by editing " "the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, " "and Bash tools. Iterate by rendering and visually comparing to " "the reference, then stop when the render matches." ), }, { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}, {"type": "text", "text": ( "Reference screenshot above.\n" "WORKDIR: /workspace/my-app\n" "Target file: app/screen.tsx\n\n" "Build the screen that matches the reference." )}, ], }, ], tools=[...], # your tool schemas ) ``` ### With transformers ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor model = Qwen3VLForConditionalGeneration.from_pretrained( "Scrymore/stone-preview-4b", torch_dtype="bfloat16", device_map="auto", ) processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b") ``` > **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support. ## Training Details ### Data Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace: - **trajectory** — full multi-turn agent loop (system → user(ref-image) → assistant tool calls → tool results → ...) - **oneshot** — reference image → final TSX code in one turn - **turn** — predict the next assistant message given history up to turn k Final dataset (v3): 1,232 train / 66 val records, filtered to ≤23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace. ### Key Design Decisions - **Inline XML tool calls** — ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region. - **Visual feedback loop** — `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate. - **Loss scale weighting** — post-Render assistant turns weighted 2.5x to emphasize iteration behavior. ### Config | | | |---|---| | **Framework** | ms-swift 4.2.1 | | **LoRA** | r=32, alpha=64, all-linear targets | | **Precision** | bfloat16 | | **Attention** | FlashAttention 2 | | **Max context** | 32,768 tokens | | **Max pixels** | 602,112 (576 tokens/image) | | **LR** | 5e-5, cosine schedule, 5% warmup | | **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) | ## GGUF Quantizations | File | Size | Description | |------|------|-------------| | `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision | | `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) | | `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector | ## Limitations - Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested. - The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results. - Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results. - Requires `transformers >= 5.3` for `qwen3_5` model type support. - When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas. ## Citation ```bibtex @misc{stone-preview-4b-2026, title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building}, author={Ejiro Pinnock}, year={2026}, url={https://huggingface.co/Scrymore/stone-preview-4b} } ```