File size: 7,031 Bytes
937f6b2
 
 
 
 
 
 
d7d17a8
937f6b2
d7d17a8
 
937f6b2
 
 
 
 
 
 
 
d7d17a8
937f6b2
d7d17a8
937f6b2
f9f3eb7
 
 
 
 
 
937f6b2
 
 
 
 
 
 
d7d17a8
 
 
 
937f6b2
 
d7d17a8
937f6b2
d7d17a8
937f6b2
d7d17a8
 
937f6b2
d7d17a8
937f6b2
d7d17a8
937f6b2
d7d17a8
937f6b2
d7d17a8
 
 
 
 
 
 
 
 
 
 
 
 
937f6b2
 
d7d17a8
 
937f6b2
 
d7d17a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
937f6b2
 
 
 
 
 
 
 
 
 
 
d7d17a8
937f6b2
d7d17a8
937f6b2
d7d17a8
937f6b2
d7d17a8
937f6b2
d7d17a8
 
 
937f6b2
d7d17a8
937f6b2
d7d17a8
937f6b2
d7d17a8
 
 
937f6b2
d7d17a8
937f6b2
d7d17a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
937f6b2
 
 
d7d17a8
 
 
937f6b2
d7d17a8
937f6b2
 
 
 
d7d17a8
 
937f6b2
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
tags:
  - vision
  - vlm
  - agent
  - mobile-ui
  - react-native
  - tool-use
  - qwen3.5
  - lora
  - fine-tuned
pipeline_tag: image-text-to-text
language:
  - en
---

# Stone Preview 4B β€” Multimodal UI Agent

A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.

<p align="center">
  <img src="assets/showcase.png" alt="Sample outputs from Stone Preview 4B" width="800">
  <br>
  <em>Screens built by the model from reference screenshots β€” Nike, How We Feel, Vivid</em>
</p>

## Model Details

| | |
|---|---|
| **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
| **Architecture** | `Qwen3_5ForConditionalGeneration` β€” 32 layers, 2560 hidden, 16 attention heads |
| **Parameters** | ~4B (bfloat16) |
| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 |
| **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback |
| **Context** | 32,768 tokens max |
| **Hardware** | 4x H200 SXM |
| **Format** | Merged safetensors (LoRA folded into base weights) |

## What the Model Does

The model has learned two core behaviors:

1. **Visual reasoning** β€” analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
2. **Tool-call emission** β€” given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate

Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.

### Tool Call Format

The model emits tool calls as inline XML in its responses:

```xml
<tool_call>
<function=Write>
<parameter=file_path>
app/(flows)/flow-001/screen-001.tsx
</parameter>
<parameter=content>
import React from 'react';
import { View, Text, StyleSheet } from 'react-native';
// ... component code
</parameter>
</function>
</tool_call>
```

Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`

## Usage

### With vLLM (recommended)

```bash
vllm serve Scrymore/stone-preview-4b \
  --dtype bfloat16 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml
```

```python
from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Encode reference screenshot
with open("reference.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Scrymore/stone-preview-4b",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a React Native (Expo) UI engineer. Given a reference "
                "mobile-app screenshot, you build the matching screen by editing "
                "the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
                "and Bash tools. Iterate by rendering and visually comparing to "
                "the reference, then stop when the render matches."
            ),
        },
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
                {"type": "text", "text": (
                    "Reference screenshot above.\n"
                    "WORKDIR: /workspace/my-app\n"
                    "Target file: app/screen.tsx\n\n"
                    "Build the screen that matches the reference."
                )},
            ],
        },
    ],
    tools=[...],  # your tool schemas
)
```

### With transformers

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Scrymore/stone-preview-4b",
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
```

> **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support.

## Training Details

### Data

Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:

- **trajectory** β€” full multi-turn agent loop (system β†’ user(ref-image) β†’ assistant tool calls β†’ tool results β†’ ...)
- **oneshot** β€” reference image β†’ final TSX code in one turn
- **turn** β€” predict the next assistant message given history up to turn k

Final dataset (v3): 1,232 train / 66 val records, filtered to ≀23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.

### Key Design Decisions

- **Inline XML tool calls** β€” ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
- **Visual feedback loop** β€” `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
- **Loss scale weighting** β€” post-Render assistant turns weighted 2.5x to emphasize iteration behavior.

### Config

| | |
|---|---|
| **Framework** | ms-swift 4.2.1 |
| **LoRA** | r=32, alpha=64, all-linear targets |
| **Precision** | bfloat16 |
| **Attention** | FlashAttention 2 |
| **Max context** | 32,768 tokens |
| **Max pixels** | 602,112 (576 tokens/image) |
| **LR** | 5e-5, cosine schedule, 5% warmup |
| **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) |

## GGUF Quantizations

| File | Size | Description |
|------|------|-------------|
| `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision |
| `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) |
| `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector |

## Limitations

- Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
- The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
- Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
- Requires `transformers >= 5.3` for `qwen3_5` model type support.
- When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.

## Citation

```bibtex
@misc{stone-preview-4b-2026,
  title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
  author={Ejiro Pinnock},
  year={2026},
  url={https://huggingface.co/Scrymore/stone-preview-4b}
}
```