Image-Text-to-Text
Transformers
Safetensors
GGUF
English
qwen3_5
vision
vlm
agent
mobile-ui
react-native
tool-use
qwen3.5
lora
fine-tuned
conversational
Instructions to use Scrymore/stone-preview-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Scrymore/stone-preview-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Scrymore/stone-preview-4b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b") model = AutoModelForImageTextToText.from_pretrained("Scrymore/stone-preview-4b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use Scrymore/stone-preview-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Scrymore/stone-preview-4b", filename="stone-preview-4b-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Scrymore/stone-preview-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: llama-cli -hf Scrymore/stone-preview-4b:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: llama-cli -hf Scrymore/stone-preview-4b:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: ./llama-cli -hf Scrymore/stone-preview-4b:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Scrymore/stone-preview-4b:F16
Use Docker
docker model run hf.co/Scrymore/stone-preview-4b:F16
- LM Studio
- Jan
- vLLM
How to use Scrymore/stone-preview-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Scrymore/stone-preview-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scrymore/stone-preview-4b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Scrymore/stone-preview-4b:F16
- SGLang
How to use Scrymore/stone-preview-4b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Scrymore/stone-preview-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scrymore/stone-preview-4b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Scrymore/stone-preview-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scrymore/stone-preview-4b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use Scrymore/stone-preview-4b with Ollama:
ollama run hf.co/Scrymore/stone-preview-4b:F16
- Unsloth Studio new
How to use Scrymore/stone-preview-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Scrymore/stone-preview-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Scrymore/stone-preview-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Scrymore/stone-preview-4b to start chatting
- Pi new
How to use Scrymore/stone-preview-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Scrymore/stone-preview-4b:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Scrymore/stone-preview-4b:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Scrymore/stone-preview-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Scrymore/stone-preview-4b:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Scrymore/stone-preview-4b:F16
Run Hermes
hermes
- Docker Model Runner
How to use Scrymore/stone-preview-4b with Docker Model Runner:
docker model run hf.co/Scrymore/stone-preview-4b:F16
- Lemonade
How to use Scrymore/stone-preview-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Scrymore/stone-preview-4b:F16
Run and chat with the model
lemonade run user.stone-preview-4b-F16
List all available models
lemonade list
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -5,21 +5,21 @@ base_model: Qwen/Qwen3.5-4B
|
|
| 5 |
tags:
|
| 6 |
- vision
|
| 7 |
- vlm
|
|
|
|
| 8 |
- mobile-ui
|
| 9 |
-
-
|
|
|
|
| 10 |
- qwen3.5
|
| 11 |
- lora
|
| 12 |
- fine-tuned
|
| 13 |
pipeline_tag: image-text-to-text
|
| 14 |
language:
|
| 15 |
- en
|
| 16 |
-
datasets:
|
| 17 |
-
- Scrymore/scry-stage1-v2-data
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# Stone Preview 4B β
|
| 21 |
|
| 22 |
-
A fine-tuned vision-language model that
|
| 23 |
|
| 24 |
## Model Details
|
| 25 |
|
|
@@ -28,50 +28,97 @@ A fine-tuned vision-language model that generates structured JSON descriptions o
|
|
| 28 |
| **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
|
| 29 |
| **Architecture** | `Qwen3_5ForConditionalGeneration` β 32 layers, 2560 hidden, 16 attention heads |
|
| 30 |
| **Parameters** | ~4B (bfloat16) |
|
| 31 |
-
| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) |
|
| 32 |
-
| **Training data** |
|
| 33 |
-
| **
|
| 34 |
-
| **
|
| 35 |
| **Format** | Merged safetensors (LoRA folded into base weights) |
|
| 36 |
|
| 37 |
-
##
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
"content_density": "sparse",
|
| 63 |
-
"design_patterns": ["progressive-disclosure", "floating-action-button"],
|
| 64 |
-
"illustration_style": "none",
|
| 65 |
-
"emotional_tone": "neutral",
|
| 66 |
-
"keyboard_visible": false
|
| 67 |
-
}
|
| 68 |
```
|
| 69 |
|
|
|
|
|
|
|
| 70 |
## Usage
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
```python
|
| 73 |
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
|
| 74 |
-
from PIL import Image
|
| 75 |
|
| 76 |
model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 77 |
"Scrymore/stone-preview-4b",
|
|
@@ -79,85 +126,62 @@ model = Qwen3VLForConditionalGeneration.from_pretrained(
|
|
| 79 |
device_map="auto",
|
| 80 |
)
|
| 81 |
processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
|
| 82 |
-
|
| 83 |
-
image = Image.open("screenshot.png")
|
| 84 |
-
|
| 85 |
-
messages = [
|
| 86 |
-
{
|
| 87 |
-
"role": "user",
|
| 88 |
-
"content": [
|
| 89 |
-
{"type": "image", "image": image},
|
| 90 |
-
{"type": "text", "text": (
|
| 91 |
-
"Analyze this mobile app screenshot. Provide a structured JSON description including: "
|
| 92 |
-
"a natural language description, tags, screen type and subtype, visible text, "
|
| 93 |
-
"UI components, layout pattern, color scheme, primary action, content density, "
|
| 94 |
-
"design patterns, illustration style, emotional tone, and keyboard visibility."
|
| 95 |
-
)},
|
| 96 |
-
],
|
| 97 |
-
}
|
| 98 |
-
]
|
| 99 |
-
|
| 100 |
-
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 101 |
-
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
|
| 102 |
-
|
| 103 |
-
output_ids = model.generate(**inputs, max_new_tokens=1024, enable_thinking=False)
|
| 104 |
-
response = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
|
| 105 |
-
print(response)
|
| 106 |
```
|
| 107 |
|
| 108 |
-
> **
|
| 109 |
-
|
| 110 |
-
### vLLM (recommended for batch inference)
|
| 111 |
-
|
| 112 |
-
```python
|
| 113 |
-
from vllm import LLM, SamplingParams
|
| 114 |
|
| 115 |
-
|
| 116 |
-
# ~1.1s per image on RTX 3090
|
| 117 |
-
```
|
| 118 |
|
| 119 |
-
##
|
| 120 |
|
| 121 |
-
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
|
| 126 |
|
| 127 |
-
|
| 128 |
-
|------|-------|-----|----------|
|
| 129 |
-
| 1st | **Stone Preview 4B (this model)** | **1042** | **58.9%** |
|
| 130 |
-
| 2nd | GPT-5-mini (teacher) | 1019 | 54.2% |
|
| 131 |
-
| 3rd | Qwen3-VL-32B | 939 | 37.1% |
|
| 132 |
|
| 133 |
-
|
|
|
|
|
|
|
| 134 |
|
| 135 |
-
###
|
| 136 |
|
| 137 |
-
|
|
| 138 |
-
|---
|
| 139 |
-
|
|
| 140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
## Limitations
|
| 143 |
|
| 144 |
-
- Trained on iOS app screenshots only
|
| 145 |
-
- The
|
|
|
|
| 146 |
- Requires `transformers >= 5.3` for `qwen3_5` model type support.
|
| 147 |
-
|
| 148 |
-
## Training Details
|
| 149 |
-
|
| 150 |
-
- **Framework**: Unsloth + transformers
|
| 151 |
-
- **LoRA config**: r=32, alpha=64, all-linear targets, no bias
|
| 152 |
-
- **Data**: 2,500 screenshot-description pairs generated by GPT-5-mini from 87 iOS apps, then filtered to 2,375 train / 125 val
|
| 153 |
-
- **Hardware**: 2x NVIDIA RTX 3090
|
| 154 |
-
- **Epochs**: ~3 (148 steps Γ effective batch 16 Γ· 2,375 samples)
|
| 155 |
|
| 156 |
## Citation
|
| 157 |
|
| 158 |
```bibtex
|
| 159 |
-
@misc{
|
| 160 |
-
title={Stone Preview 4B:
|
| 161 |
author={Ejiro Pinnock},
|
| 162 |
year={2026},
|
| 163 |
url={https://huggingface.co/Scrymore/stone-preview-4b}
|
|
|
|
| 5 |
tags:
|
| 6 |
- vision
|
| 7 |
- vlm
|
| 8 |
+
- agent
|
| 9 |
- mobile-ui
|
| 10 |
+
- react-native
|
| 11 |
+
- tool-use
|
| 12 |
- qwen3.5
|
| 13 |
- lora
|
| 14 |
- fine-tuned
|
| 15 |
pipeline_tag: image-text-to-text
|
| 16 |
language:
|
| 17 |
- en
|
|
|
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# Stone Preview 4B β Multimodal UI Agent
|
| 21 |
|
| 22 |
+
A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.
|
| 23 |
|
| 24 |
## Model Details
|
| 25 |
|
|
|
|
| 28 |
| **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
|
| 29 |
| **Architecture** | `Qwen3_5ForConditionalGeneration` β 32 layers, 2560 hidden, 16 attention heads |
|
| 30 |
| **Parameters** | ~4B (bfloat16) |
|
| 31 |
+
| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 |
|
| 32 |
+
| **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback |
|
| 33 |
+
| **Context** | 32,768 tokens max |
|
| 34 |
+
| **Hardware** | 4x H200 SXM |
|
| 35 |
| **Format** | Merged safetensors (LoRA folded into base weights) |
|
| 36 |
|
| 37 |
+
## What the Model Does
|
| 38 |
|
| 39 |
+
The model has learned two core behaviors:
|
| 40 |
|
| 41 |
+
1. **Visual reasoning** β analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
|
| 42 |
+
2. **Tool-call emission** β given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate
|
| 43 |
|
| 44 |
+
Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.
|
| 45 |
|
| 46 |
+
### Tool Call Format
|
| 47 |
|
| 48 |
+
The model emits tool calls as inline XML in its responses:
|
| 49 |
|
| 50 |
+
```xml
|
| 51 |
+
<tool_call>
|
| 52 |
+
<function=Write>
|
| 53 |
+
<parameter=file_path>
|
| 54 |
+
app/(flows)/flow-001/screen-001.tsx
|
| 55 |
+
</parameter>
|
| 56 |
+
<parameter=content>
|
| 57 |
+
import React from 'react';
|
| 58 |
+
import { View, Text, StyleSheet } from 'react-native';
|
| 59 |
+
// ... component code
|
| 60 |
+
</parameter>
|
| 61 |
+
</function>
|
| 62 |
+
</tool_call>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
```
|
| 64 |
|
| 65 |
+
Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`
|
| 66 |
+
|
| 67 |
## Usage
|
| 68 |
|
| 69 |
+
### With vLLM (recommended)
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
vllm serve Scrymore/stone-preview-4b \
|
| 73 |
+
--dtype bfloat16 \
|
| 74 |
+
--enable-auto-tool-choice \
|
| 75 |
+
--tool-call-parser qwen3_xml
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
```python
|
| 79 |
+
from openai import OpenAI
|
| 80 |
+
import base64
|
| 81 |
+
|
| 82 |
+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
|
| 83 |
+
|
| 84 |
+
# Encode reference screenshot
|
| 85 |
+
with open("reference.png", "rb") as f:
|
| 86 |
+
img_b64 = base64.b64encode(f.read()).decode()
|
| 87 |
+
|
| 88 |
+
response = client.chat.completions.create(
|
| 89 |
+
model="Scrymore/stone-preview-4b",
|
| 90 |
+
messages=[
|
| 91 |
+
{
|
| 92 |
+
"role": "system",
|
| 93 |
+
"content": (
|
| 94 |
+
"You are a React Native (Expo) UI engineer. Given a reference "
|
| 95 |
+
"mobile-app screenshot, you build the matching screen by editing "
|
| 96 |
+
"the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
|
| 97 |
+
"and Bash tools. Iterate by rendering and visually comparing to "
|
| 98 |
+
"the reference, then stop when the render matches."
|
| 99 |
+
),
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"role": "user",
|
| 103 |
+
"content": [
|
| 104 |
+
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
|
| 105 |
+
{"type": "text", "text": (
|
| 106 |
+
"Reference screenshot above.\n"
|
| 107 |
+
"WORKDIR: /workspace/my-app\n"
|
| 108 |
+
"Target file: app/screen.tsx\n\n"
|
| 109 |
+
"Build the screen that matches the reference."
|
| 110 |
+
)},
|
| 111 |
+
],
|
| 112 |
+
},
|
| 113 |
+
],
|
| 114 |
+
tools=[...], # your tool schemas
|
| 115 |
+
)
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### With transformers
|
| 119 |
+
|
| 120 |
```python
|
| 121 |
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
|
|
|
|
| 122 |
|
| 123 |
model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 124 |
"Scrymore/stone-preview-4b",
|
|
|
|
| 126 |
device_map="auto",
|
| 127 |
)
|
| 128 |
processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
```
|
| 130 |
|
| 131 |
+
> **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
+
## Training Details
|
|
|
|
|
|
|
| 134 |
|
| 135 |
+
### Data
|
| 136 |
|
| 137 |
+
Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:
|
| 138 |
|
| 139 |
+
- **trajectory** β full multi-turn agent loop (system β user(ref-image) β assistant tool calls β tool results β ...)
|
| 140 |
+
- **oneshot** β reference image β final TSX code in one turn
|
| 141 |
+
- **turn** β predict the next assistant message given history up to turn k
|
| 142 |
|
| 143 |
+
Final dataset (v3): 1,232 train / 66 val records, filtered to β€23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.
|
| 144 |
|
| 145 |
+
### Key Design Decisions
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
+
- **Inline XML tool calls** β ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
|
| 148 |
+
- **Visual feedback loop** β `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
|
| 149 |
+
- **Loss scale weighting** β post-Render assistant turns weighted 2.5x to emphasize iteration behavior.
|
| 150 |
|
| 151 |
+
### Config
|
| 152 |
|
| 153 |
+
| | |
|
| 154 |
+
|---|---|
|
| 155 |
+
| **Framework** | ms-swift 4.2.1 |
|
| 156 |
+
| **LoRA** | r=32, alpha=64, all-linear targets |
|
| 157 |
+
| **Precision** | bfloat16 |
|
| 158 |
+
| **Attention** | FlashAttention 2 |
|
| 159 |
+
| **Max context** | 32,768 tokens |
|
| 160 |
+
| **Max pixels** | 602,112 (576 tokens/image) |
|
| 161 |
+
| **LR** | 5e-5, cosine schedule, 5% warmup |
|
| 162 |
+
| **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) |
|
| 163 |
+
|
| 164 |
+
## GGUF Quantizations
|
| 165 |
+
|
| 166 |
+
| File | Size | Description |
|
| 167 |
+
|------|------|-------------|
|
| 168 |
+
| `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision |
|
| 169 |
+
| `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) |
|
| 170 |
+
| `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector |
|
| 171 |
|
| 172 |
## Limitations
|
| 173 |
|
| 174 |
+
- Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
|
| 175 |
+
- The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
|
| 176 |
+
- Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
|
| 177 |
- Requires `transformers >= 5.3` for `qwen3_5` model type support.
|
| 178 |
+
- When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
## Citation
|
| 181 |
|
| 182 |
```bibtex
|
| 183 |
+
@misc{stone-preview-4b-2026,
|
| 184 |
+
title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
|
| 185 |
author={Ejiro Pinnock},
|
| 186 |
year={2026},
|
| 187 |
url={https://huggingface.co/Scrymore/stone-preview-4b}
|