epinnock commited on
Commit
d7d17a8
Β·
verified Β·
1 Parent(s): 444a9b7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +119 -95
README.md CHANGED
@@ -5,21 +5,21 @@ base_model: Qwen/Qwen3.5-4B
5
  tags:
6
  - vision
7
  - vlm
 
8
  - mobile-ui
9
- - screenshot-understanding
 
10
  - qwen3.5
11
  - lora
12
  - fine-tuned
13
  pipeline_tag: image-text-to-text
14
  language:
15
  - en
16
- datasets:
17
- - Scrymore/scry-stage1-v2-data
18
  ---
19
 
20
- # Stone Preview 4B β€” Mobile Screenshot β†’ Structured Description
21
 
22
- A fine-tuned vision-language model that generates structured JSON descriptions of mobile app screenshots. Given a screenshot, it outputs a comprehensive annotation covering layout, components, color scheme, screen type, and more.
23
 
24
  ## Model Details
25
 
@@ -28,50 +28,97 @@ A fine-tuned vision-language model that generates structured JSON descriptions o
28
  | **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
29
  | **Architecture** | `Qwen3_5ForConditionalGeneration` β€” 32 layers, 2560 hidden, 16 attention heads |
30
  | **Parameters** | ~4B (bfloat16) |
31
- | **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) |
32
- | **Training data** | 2,375 GPT-5-mini-generated descriptions, 125 validation |
33
- | **Training** | 148 steps, batch size 4, gradient accumulation 4, LR=1e-4 |
34
- | **Context** | 2,048 tokens max |
35
  | **Format** | Merged safetensors (LoRA folded into base weights) |
36
 
37
- ## Intended Use
38
 
39
- This model is the **description stage** of the [Scry](https://github.com/Scrymore) UI search pipeline. It converts mobile app screenshots into structured metadata that can be indexed for text-based retrieval.
40
 
41
- ### Input
 
42
 
43
- A mobile app screenshot with the instruction prompt:
44
 
45
- > Analyze this mobile app screenshot. Provide a structured JSON description including: a natural language description, tags, screen type and subtype, visible text, UI components, layout pattern, color scheme, primary action, content density, design patterns, illustration style, emotional tone, and keyboard visibility.
46
 
47
- ### Output
48
 
49
- A JSON object with these fields:
50
-
51
- ```json
52
- {
53
- "description": "Natural language description of the screen",
54
- "tags": ["onboarding", "form", "minimal", ...],
55
- "screen_type": "Actions",
56
- "screen_subtype": "Account Setup",
57
- "visible_text": ["Sign up", "Continue", ...],
58
- "ui_components": ["text-input", "button", "progress-bar", ...],
59
- "layout_pattern": "single-column-form",
60
- "color_scheme": {"mode": "light", "primary_color": "blue", "style": "minimal"},
61
- "primary_action": "complete registration",
62
- "content_density": "sparse",
63
- "design_patterns": ["progressive-disclosure", "floating-action-button"],
64
- "illustration_style": "none",
65
- "emotional_tone": "neutral",
66
- "keyboard_visible": false
67
- }
68
  ```
69
 
 
 
70
  ## Usage
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```python
73
  from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
74
- from PIL import Image
75
 
76
  model = Qwen3VLForConditionalGeneration.from_pretrained(
77
  "Scrymore/stone-preview-4b",
@@ -79,85 +126,62 @@ model = Qwen3VLForConditionalGeneration.from_pretrained(
79
  device_map="auto",
80
  )
81
  processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
82
-
83
- image = Image.open("screenshot.png")
84
-
85
- messages = [
86
- {
87
- "role": "user",
88
- "content": [
89
- {"type": "image", "image": image},
90
- {"type": "text", "text": (
91
- "Analyze this mobile app screenshot. Provide a structured JSON description including: "
92
- "a natural language description, tags, screen type and subtype, visible text, "
93
- "UI components, layout pattern, color scheme, primary action, content density, "
94
- "design patterns, illustration style, emotional tone, and keyboard visibility."
95
- )},
96
- ],
97
- }
98
- ]
99
-
100
- text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
101
- inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
102
-
103
- output_ids = model.generate(**inputs, max_new_tokens=1024, enable_thinking=False)
104
- response = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
105
- print(response)
106
  ```
107
 
108
- > **Important:** Use `enable_thinking=False` β€” this model was trained without thinking mode.
109
-
110
- ### vLLM (recommended for batch inference)
111
-
112
- ```python
113
- from vllm import LLM, SamplingParams
114
 
115
- llm = LLM(model="Scrymore/stone-preview-4b", dtype="bfloat16")
116
- # ~1.1s per image on RTX 3090
117
- ```
118
 
119
- ## Performance
120
 
121
- ### JSON Parse Rate
122
 
123
- 100% valid JSON on 750 benchmark screenshots (vLLM, RTX 3090).
 
 
124
 
125
- ### ELO Ranking (Claude Sonnet 4.6 judge, 900 pairwise comparisons)
126
 
127
- | Rank | Model | ELO | Win Rate |
128
- |------|-------|-----|----------|
129
- | 1st | **Stone Preview 4B (this model)** | **1042** | **58.9%** |
130
- | 2nd | GPT-5-mini (teacher) | 1019 | 54.2% |
131
- | 3rd | Qwen3-VL-32B | 939 | 37.1% |
132
 
133
- The fine-tuned 4B model **outperforms its teacher** (GPT-5-mini) and a model 8x its size (Qwen3-VL-32B) on description quality as judged by Claude Sonnet 4.6.
 
 
134
 
135
- ### Inference Speed
136
 
137
- | Backend | Hardware | Speed |
138
- |---------|----------|-------|
139
- | vLLM | RTX 3090 | ~1.1s/image |
140
- | llama.cpp (Q4_K_M) | EPYC 7551P (CPU) | ~35-40s/image, 10.4 tok/s |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  ## Limitations
143
 
144
- - Trained on iOS app screenshots only β€” performance on Android, web, or desktop UIs is untested.
145
- - The 87 apps in the training set skew toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality descriptions.
 
146
  - Requires `transformers >= 5.3` for `qwen3_5` model type support.
147
-
148
- ## Training Details
149
-
150
- - **Framework**: Unsloth + transformers
151
- - **LoRA config**: r=32, alpha=64, all-linear targets, no bias
152
- - **Data**: 2,500 screenshot-description pairs generated by GPT-5-mini from 87 iOS apps, then filtered to 2,375 train / 125 val
153
- - **Hardware**: 2x NVIDIA RTX 3090
154
- - **Epochs**: ~3 (148 steps Γ— effective batch 16 Γ· 2,375 samples)
155
 
156
  ## Citation
157
 
158
  ```bibtex
159
- @misc{scry-stage1-5-2026,
160
- title={Stone Preview 4B: Fine-tuned Qwen 3.5 4B for Mobile UI Screenshot Description},
161
  author={Ejiro Pinnock},
162
  year={2026},
163
  url={https://huggingface.co/Scrymore/stone-preview-4b}
 
5
  tags:
6
  - vision
7
  - vlm
8
+ - agent
9
  - mobile-ui
10
+ - react-native
11
+ - tool-use
12
  - qwen3.5
13
  - lora
14
  - fine-tuned
15
  pipeline_tag: image-text-to-text
16
  language:
17
  - en
 
 
18
  ---
19
 
20
+ # Stone Preview 4B β€” Multimodal UI Agent
21
 
22
+ A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.
23
 
24
  ## Model Details
25
 
 
28
  | **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
29
  | **Architecture** | `Qwen3_5ForConditionalGeneration` β€” 32 layers, 2560 hidden, 16 attention heads |
30
  | **Parameters** | ~4B (bfloat16) |
31
+ | **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 |
32
+ | **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback |
33
+ | **Context** | 32,768 tokens max |
34
+ | **Hardware** | 4x H200 SXM |
35
  | **Format** | Merged safetensors (LoRA folded into base weights) |
36
 
37
+ ## What the Model Does
38
 
39
+ The model has learned two core behaviors:
40
 
41
+ 1. **Visual reasoning** β€” analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
42
+ 2. **Tool-call emission** β€” given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate
43
 
44
+ Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.
45
 
46
+ ### Tool Call Format
47
 
48
+ The model emits tool calls as inline XML in its responses:
49
 
50
+ ```xml
51
+ <tool_call>
52
+ <function=Write>
53
+ <parameter=file_path>
54
+ app/(flows)/flow-001/screen-001.tsx
55
+ </parameter>
56
+ <parameter=content>
57
+ import React from 'react';
58
+ import { View, Text, StyleSheet } from 'react-native';
59
+ // ... component code
60
+ </parameter>
61
+ </function>
62
+ </tool_call>
 
 
 
 
 
 
63
  ```
64
 
65
+ Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`
66
+
67
  ## Usage
68
 
69
+ ### With vLLM (recommended)
70
+
71
+ ```bash
72
+ vllm serve Scrymore/stone-preview-4b \
73
+ --dtype bfloat16 \
74
+ --enable-auto-tool-choice \
75
+ --tool-call-parser qwen3_xml
76
+ ```
77
+
78
+ ```python
79
+ from openai import OpenAI
80
+ import base64
81
+
82
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
83
+
84
+ # Encode reference screenshot
85
+ with open("reference.png", "rb") as f:
86
+ img_b64 = base64.b64encode(f.read()).decode()
87
+
88
+ response = client.chat.completions.create(
89
+ model="Scrymore/stone-preview-4b",
90
+ messages=[
91
+ {
92
+ "role": "system",
93
+ "content": (
94
+ "You are a React Native (Expo) UI engineer. Given a reference "
95
+ "mobile-app screenshot, you build the matching screen by editing "
96
+ "the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
97
+ "and Bash tools. Iterate by rendering and visually comparing to "
98
+ "the reference, then stop when the render matches."
99
+ ),
100
+ },
101
+ {
102
+ "role": "user",
103
+ "content": [
104
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
105
+ {"type": "text", "text": (
106
+ "Reference screenshot above.\n"
107
+ "WORKDIR: /workspace/my-app\n"
108
+ "Target file: app/screen.tsx\n\n"
109
+ "Build the screen that matches the reference."
110
+ )},
111
+ ],
112
+ },
113
+ ],
114
+ tools=[...], # your tool schemas
115
+ )
116
+ ```
117
+
118
+ ### With transformers
119
+
120
  ```python
121
  from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
 
122
 
123
  model = Qwen3VLForConditionalGeneration.from_pretrained(
124
  "Scrymore/stone-preview-4b",
 
126
  device_map="auto",
127
  )
128
  processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  ```
130
 
131
+ > **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support.
 
 
 
 
 
132
 
133
+ ## Training Details
 
 
134
 
135
+ ### Data
136
 
137
+ Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:
138
 
139
+ - **trajectory** β€” full multi-turn agent loop (system β†’ user(ref-image) β†’ assistant tool calls β†’ tool results β†’ ...)
140
+ - **oneshot** β€” reference image β†’ final TSX code in one turn
141
+ - **turn** β€” predict the next assistant message given history up to turn k
142
 
143
+ Final dataset (v3): 1,232 train / 66 val records, filtered to ≀23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.
144
 
145
+ ### Key Design Decisions
 
 
 
 
146
 
147
+ - **Inline XML tool calls** β€” ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
148
+ - **Visual feedback loop** β€” `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
149
+ - **Loss scale weighting** β€” post-Render assistant turns weighted 2.5x to emphasize iteration behavior.
150
 
151
+ ### Config
152
 
153
+ | | |
154
+ |---|---|
155
+ | **Framework** | ms-swift 4.2.1 |
156
+ | **LoRA** | r=32, alpha=64, all-linear targets |
157
+ | **Precision** | bfloat16 |
158
+ | **Attention** | FlashAttention 2 |
159
+ | **Max context** | 32,768 tokens |
160
+ | **Max pixels** | 602,112 (576 tokens/image) |
161
+ | **LR** | 5e-5, cosine schedule, 5% warmup |
162
+ | **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) |
163
+
164
+ ## GGUF Quantizations
165
+
166
+ | File | Size | Description |
167
+ |------|------|-------------|
168
+ | `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision |
169
+ | `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) |
170
+ | `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector |
171
 
172
  ## Limitations
173
 
174
+ - Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
175
+ - The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
176
+ - Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
177
  - Requires `transformers >= 5.3` for `qwen3_5` model type support.
178
+ - When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.
 
 
 
 
 
 
 
179
 
180
  ## Citation
181
 
182
  ```bibtex
183
+ @misc{stone-preview-4b-2026,
184
+ title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
185
  author={Ejiro Pinnock},
186
  year={2026},
187
  url={https://huggingface.co/Scrymore/stone-preview-4b}