---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
tags:
  - vision
  - vlm
  - agent
  - mobile-ui
  - react-native
  - tool-use
  - qwen3.5
  - lora
  - fine-tuned
pipeline_tag: image-text-to-text
language:
  - en
---

# Stone Preview 4B — Multimodal UI Agent

A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.

<p align="center">
  <img src="assets/showcase.png" alt="Sample outputs from Stone Preview 4B" width="800">
  <br>
  <em>Screens built by the model from reference screenshots — Nike, How We Feel, Vivid</em>
</p>

## Model Details

| | |
|---|---|
| **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
| **Architecture** | `Qwen3_5ForConditionalGeneration` — 32 layers, 2560 hidden, 16 attention heads |
| **Parameters** | ~4B (bfloat16) |
| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 |
| **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback |
| **Context** | 32,768 tokens max |
| **Hardware** | 4x H200 SXM |
| **Format** | Merged safetensors (LoRA folded into base weights) |

## What the Model Does

The model has learned two core behaviors:

1. **Visual reasoning** — analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
2. **Tool-call emission** — given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate

Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.

### Tool Call Format

The model emits tool calls as inline XML in its responses:

```xml
<tool_call>
<function=Write>
<parameter=file_path>
app/(flows)/flow-001/screen-001.tsx
</parameter>
<parameter=content>
import React from 'react';
import { View, Text, StyleSheet } from 'react-native';
// ... component code
</parameter>
</function>
</tool_call>
```

Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`

## Usage

### With vLLM (recommended)

```bash
vllm serve Scrymore/stone-preview-4b \
  --dtype bfloat16 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml
```

```python
from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Encode reference screenshot
with open("reference.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Scrymore/stone-preview-4b",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a React Native (Expo) UI engineer. Given a reference "
                "mobile-app screenshot, you build the matching screen by editing "
                "the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
                "and Bash tools. Iterate by rendering and visually comparing to "
                "the reference, then stop when the render matches."
            ),
        },
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
                {"type": "text", "text": (
                    "Reference screenshot above.\n"
                    "WORKDIR: /workspace/my-app\n"
                    "Target file: app/screen.tsx\n\n"
                    "Build the screen that matches the reference."
                )},
            ],
        },
    ],
    tools=[...],  # your tool schemas
)
```

### With transformers

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Scrymore/stone-preview-4b",
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
```

> **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support.

## Training Details

### Data

Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:

- **trajectory** — full multi-turn agent loop (system → user(ref-image) → assistant tool calls → tool results → ...)
- **oneshot** — reference image → final TSX code in one turn
- **turn** — predict the next assistant message given history up to turn k

Final dataset (v3): 1,232 train / 66 val records, filtered to ≤23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.

### Key Design Decisions

- **Inline XML tool calls** — ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
- **Visual feedback loop** — `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
- **Loss scale weighting** — post-Render assistant turns weighted 2.5x to emphasize iteration behavior.

### Config

| | |
|---|---|
| **Framework** | ms-swift 4.2.1 |
| **LoRA** | r=32, alpha=64, all-linear targets |
| **Precision** | bfloat16 |
| **Attention** | FlashAttention 2 |
| **Max context** | 32,768 tokens |
| **Max pixels** | 602,112 (576 tokens/image) |
| **LR** | 5e-5, cosine schedule, 5% warmup |
| **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) |

## GGUF Quantizations

| File | Size | Description |
|------|------|-------------|
| `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision |
| `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) |
| `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector |

## Limitations

- Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
- The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
- Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
- Requires `transformers >= 5.3` for `qwen3_5` model type support.
- When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.

## Citation

```bibtex
@misc{stone-preview-4b-2026,
  title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
  author={Ejiro Pinnock},
  year={2026},
  url={https://huggingface.co/Scrymore/stone-preview-4b}
}
```