stone-preview-4b / README.md
epinnock's picture
Upload README.md with huggingface_hub
f9f3eb7 verified
---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
tags:
- vision
- vlm
- agent
- mobile-ui
- react-native
- tool-use
- qwen3.5
- lora
- fine-tuned
pipeline_tag: image-text-to-text
language:
- en
---
# Stone Preview 4B β€” Multimodal UI Agent
A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.
<p align="center">
<img src="assets/showcase.png" alt="Sample outputs from Stone Preview 4B" width="800">
<br>
<em>Screens built by the model from reference screenshots β€” Nike, How We Feel, Vivid</em>
</p>
## Model Details
| | |
|---|---|
| **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
| **Architecture** | `Qwen3_5ForConditionalGeneration` β€” 32 layers, 2560 hidden, 16 attention heads |
| **Parameters** | ~4B (bfloat16) |
| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 |
| **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback |
| **Context** | 32,768 tokens max |
| **Hardware** | 4x H200 SXM |
| **Format** | Merged safetensors (LoRA folded into base weights) |
## What the Model Does
The model has learned two core behaviors:
1. **Visual reasoning** β€” analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
2. **Tool-call emission** β€” given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate
Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.
### Tool Call Format
The model emits tool calls as inline XML in its responses:
```xml
<tool_call>
<function=Write>
<parameter=file_path>
app/(flows)/flow-001/screen-001.tsx
</parameter>
<parameter=content>
import React from 'react';
import { View, Text, StyleSheet } from 'react-native';
// ... component code
</parameter>
</function>
</tool_call>
```
Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`
## Usage
### With vLLM (recommended)
```bash
vllm serve Scrymore/stone-preview-4b \
--dtype bfloat16 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml
```
```python
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
# Encode reference screenshot
with open("reference.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="Scrymore/stone-preview-4b",
messages=[
{
"role": "system",
"content": (
"You are a React Native (Expo) UI engineer. Given a reference "
"mobile-app screenshot, you build the matching screen by editing "
"the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
"and Bash tools. Iterate by rendering and visually comparing to "
"the reference, then stop when the render matches."
),
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
{"type": "text", "text": (
"Reference screenshot above.\n"
"WORKDIR: /workspace/my-app\n"
"Target file: app/screen.tsx\n\n"
"Build the screen that matches the reference."
)},
],
},
],
tools=[...], # your tool schemas
)
```
### With transformers
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Scrymore/stone-preview-4b",
torch_dtype="bfloat16",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
```
> **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support.
## Training Details
### Data
Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:
- **trajectory** β€” full multi-turn agent loop (system β†’ user(ref-image) β†’ assistant tool calls β†’ tool results β†’ ...)
- **oneshot** β€” reference image β†’ final TSX code in one turn
- **turn** β€” predict the next assistant message given history up to turn k
Final dataset (v3): 1,232 train / 66 val records, filtered to ≀23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.
### Key Design Decisions
- **Inline XML tool calls** β€” ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
- **Visual feedback loop** β€” `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
- **Loss scale weighting** β€” post-Render assistant turns weighted 2.5x to emphasize iteration behavior.
### Config
| | |
|---|---|
| **Framework** | ms-swift 4.2.1 |
| **LoRA** | r=32, alpha=64, all-linear targets |
| **Precision** | bfloat16 |
| **Attention** | FlashAttention 2 |
| **Max context** | 32,768 tokens |
| **Max pixels** | 602,112 (576 tokens/image) |
| **LR** | 5e-5, cosine schedule, 5% warmup |
| **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) |
## GGUF Quantizations
| File | Size | Description |
|------|------|-------------|
| `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision |
| `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) |
| `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector |
## Limitations
- Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
- The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
- Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
- Requires `transformers >= 5.3` for `qwen3_5` model type support.
- When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.
## Citation
```bibtex
@misc{stone-preview-4b-2026,
title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
author={Ejiro Pinnock},
year={2026},
url={https://huggingface.co/Scrymore/stone-preview-4b}
}
```