Instructions to use Scrymore/stone-preview-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Scrymore/stone-preview-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Scrymore/stone-preview-4b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
model = AutoModelForMultimodalLM.from_pretrained("Scrymore/stone-preview-4b", device_map="auto")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use Scrymore/stone-preview-4b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Scrymore/stone-preview-4b",
	filename="stone-preview-4b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Scrymore/stone-preview-4b with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
llama cli -hf Scrymore/stone-preview-4b:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
llama cli -hf Scrymore/stone-preview-4b:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
./llama-cli -hf Scrymore/stone-preview-4b:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Scrymore/stone-preview-4b:F16

Use Docker

docker model run hf.co/Scrymore/stone-preview-4b:F16

LM Studio
Jan

vLLM

How to use Scrymore/stone-preview-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Scrymore/stone-preview-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Scrymore/stone-preview-4b:F16

SGLang

How to use Scrymore/stone-preview-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Scrymore/stone-preview-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Scrymore/stone-preview-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use Scrymore/stone-preview-4b with Ollama:
```
ollama run hf.co/Scrymore/stone-preview-4b:F16
```

Unsloth Studio

How to use Scrymore/stone-preview-4b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Scrymore/stone-preview-4b to start chatting

How to use Scrymore/stone-preview-4b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Scrymore/stone-preview-4b:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Scrymore/stone-preview-4b:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Scrymore/stone-preview-4b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Scrymore/stone-preview-4b:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Scrymore/stone-preview-4b:F16

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use Scrymore/stone-preview-4b with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Scrymore/stone-preview-4b:F16

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "Scrymore/stone-preview-4b:F16" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use Scrymore/stone-preview-4b with Docker Model Runner:
```
docker model run hf.co/Scrymore/stone-preview-4b:F16
```

Lemonade

How to use Scrymore/stone-preview-4b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Scrymore/stone-preview-4b:F16

Run and chat with the model

lemonade run user.stone-preview-4b-F16

List all available models

lemonade list

epinnock commited on May 27

Commit

d7d17a8

verified ·

1 Parent(s): 444a9b7

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +119 -95

README.md CHANGED Viewed

@@ -5,21 +5,21 @@ base_model: Qwen/Qwen3.5-4B
 tags:
   - vision
   - vlm
   - mobile-ui
-  - screenshot-understanding
   - qwen3.5
   - lora
   - fine-tuned
 pipeline_tag: image-text-to-text
 language:
   - en
-datasets:
-  - Scrymore/scry-stage1-v2-data
 ---
-# Stone Preview 4B — Mobile Screenshot → Structured Description
-A fine-tuned vision-language model that generates structured JSON descriptions of mobile app screenshots. Given a screenshot, it outputs a comprehensive annotation covering layout, components, color scheme, screen type, and more.
 ## Model Details
@@ -28,50 +28,97 @@ A fine-tuned vision-language model that generates structured JSON descriptions o
 | **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
 | **Architecture** | `Qwen3_5ForConditionalGeneration` — 32 layers, 2560 hidden, 16 attention heads |
 | **Parameters** | ~4B (bfloat16) |
-| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) |
-| **Training data** | 2,375 GPT-5-mini-generated descriptions, 125 validation |
-| **Training** | 148 steps, batch size 4, gradient accumulation 4, LR=1e-4 |
-| **Context** | 2,048 tokens max |
 | **Format** | Merged safetensors (LoRA folded into base weights) |
-## Intended Use
-This model is the **description stage** of the [Scry](https://github.com/Scrymore) UI search pipeline. It converts mobile app screenshots into structured metadata that can be indexed for text-based retrieval.
-### Input
-A mobile app screenshot with the instruction prompt:
-> Analyze this mobile app screenshot. Provide a structured JSON description including: a natural language description, tags, screen type and subtype, visible text, UI components, layout pattern, color scheme, primary action, content density, design patterns, illustration style, emotional tone, and keyboard visibility.
-### Output
-A JSON object with these fields:
-```json
-{
-  "description": "Natural language description of the screen",
-  "tags": ["onboarding", "form", "minimal", ...],
-  "screen_type": "Actions",
-  "screen_subtype": "Account Setup",
-  "visible_text": ["Sign up", "Continue", ...],
-  "ui_components": ["text-input", "button", "progress-bar", ...],
-  "layout_pattern": "single-column-form",
-  "color_scheme": {"mode": "light", "primary_color": "blue", "style": "minimal"},
-  "primary_action": "complete registration",
-  "content_density": "sparse",
-  "design_patterns": ["progressive-disclosure", "floating-action-button"],
-  "illustration_style": "none",
-  "emotional_tone": "neutral",
-  "keyboard_visible": false
-}
 ```
 ## Usage
 ```python
 from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
-from PIL import Image
 model = Qwen3VLForConditionalGeneration.from_pretrained(
     "Scrymore/stone-preview-4b",
@@ -79,85 +126,62 @@ model = Qwen3VLForConditionalGeneration.from_pretrained(
     device_map="auto",
 )
 processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
-image = Image.open("screenshot.png")
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "image": image},
-            {"type": "text", "text": (
-                "Analyze this mobile app screenshot. Provide a structured JSON description including: "
-                "a natural language description, tags, screen type and subtype, visible text, "
-                "UI components, layout pattern, color scheme, primary action, content density, "
-                "design patterns, illustration style, emotional tone, and keyboard visibility."
-            )},
-        ],
-    }
-]
-text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
-output_ids = model.generate(**inputs, max_new_tokens=1024, enable_thinking=False)
-response = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
-print(response)
 ```
-> **Important:** Use `enable_thinking=False` — this model was trained without thinking mode.
-### vLLM (recommended for batch inference)
-```python
-from vllm import LLM, SamplingParams
-llm = LLM(model="Scrymore/stone-preview-4b", dtype="bfloat16")
-# ~1.1s per image on RTX 3090
-```
-## Performance
-### JSON Parse Rate
-100% valid JSON on 750 benchmark screenshots (vLLM, RTX 3090).
-### ELO Ranking (Claude Sonnet 4.6 judge, 900 pairwise comparisons)
-| Rank | Model | ELO | Win Rate |
-|------|-------|-----|----------|
-| 1st | **Stone Preview 4B (this model)** | **1042** | **58.9%** |
-| 2nd | GPT-5-mini (teacher) | 1019 | 54.2% |
-| 3rd | Qwen3-VL-32B | 939 | 37.1% |
-The fine-tuned 4B model **outperforms its teacher** (GPT-5-mini) and a model 8x its size (Qwen3-VL-32B) on description quality as judged by Claude Sonnet 4.6.
-### Inference Speed
-| Backend | Hardware | Speed |
-|---------|----------|-------|
-| vLLM | RTX 3090 | ~1.1s/image |
-| llama.cpp (Q4_K_M) | EPYC 7551P (CPU) | ~35-40s/image, 10.4 tok/s |
 ## Limitations
-- Trained on iOS app screenshots only — performance on Android, web, or desktop UIs is untested.
-- The 87 apps in the training set skew toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality descriptions.
 - Requires `transformers >= 5.3` for `qwen3_5` model type support.
-## Training Details
-- **Framework**: Unsloth + transformers
-- **LoRA config**: r=32, alpha=64, all-linear targets, no bias
-- **Data**: 2,500 screenshot-description pairs generated by GPT-5-mini from 87 iOS apps, then filtered to 2,375 train / 125 val
-- **Hardware**: 2x NVIDIA RTX 3090
-- **Epochs**: ~3 (148 steps × effective batch 16 ÷ 2,375 samples)
 ## Citation
 ```bibtex
-@misc{scry-stage1-5-2026,
-  title={Stone Preview 4B: Fine-tuned Qwen 3.5 4B for Mobile UI Screenshot Description},
   author={Ejiro Pinnock},
   year={2026},
   url={https://huggingface.co/Scrymore/stone-preview-4b}

 tags:
   - vision
   - vlm
+  - agent
   - mobile-ui
+  - react-native
+  - tool-use
   - qwen3.5
   - lora
   - fine-tuned
 pipeline_tag: image-text-to-text
 language:
   - en
 ---
+# Stone Preview 4B — Multimodal UI Agent
+A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.
 ## Model Details
 | **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
 | **Architecture** | `Qwen3_5ForConditionalGeneration` — 32 layers, 2560 hidden, 16 attention heads |
 | **Parameters** | ~4B (bfloat16) |
+| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 |
+| **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback |
+| **Context** | 32,768 tokens max |
+| **Hardware** | 4x H200 SXM |
 | **Format** | Merged safetensors (LoRA folded into base weights) |
+## What the Model Does
+The model has learned two core behaviors:
+1. **Visual reasoning** — analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
+2. **Tool-call emission** — given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate
+Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.
+### Tool Call Format
+The model emits tool calls as inline XML in its responses:
+```xml
+<tool_call>
+<function=Write>
+<parameter=file_path>
+app/(flows)/flow-001/screen-001.tsx
+</parameter>
+<parameter=content>
+import React from 'react';
+import { View, Text, StyleSheet } from 'react-native';
+// ... component code
+</parameter>
+</function>
+</tool_call>
 ```
+Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`
 ## Usage
+### With vLLM (recommended)
+```bash
+vllm serve Scrymore/stone-preview-4b \
+  --dtype bfloat16 \
+  --enable-auto-tool-choice \
+  --tool-call-parser qwen3_xml
+```
+```python
+from openai import OpenAI
+import base64
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
+# Encode reference screenshot
+with open("reference.png", "rb") as f:
+    img_b64 = base64.b64encode(f.read()).decode()
+response = client.chat.completions.create(
+    model="Scrymore/stone-preview-4b",
+    messages=[
+        {
+            "role": "system",
+            "content": (
+                "You are a React Native (Expo) UI engineer. Given a reference "
+                "mobile-app screenshot, you build the matching screen by editing "
+                "the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
+                "and Bash tools. Iterate by rendering and visually comparing to "
+                "the reference, then stop when the render matches."
+            ),
+        },
+        {
+            "role": "user",
+            "content": [
+                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
+                {"type": "text", "text": (
+                    "Reference screenshot above.\n"
+                    "WORKDIR: /workspace/my-app\n"
+                    "Target file: app/screen.tsx\n\n"
+                    "Build the screen that matches the reference."
+                )},
+            ],
+        },
+    ],
+    tools=[...],  # your tool schemas
+)
+```
+### With transformers
 ```python
 from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
 model = Qwen3VLForConditionalGeneration.from_pretrained(
     "Scrymore/stone-preview-4b",
     device_map="auto",
 )
 processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
 ```
+> **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support.
+## Training Details
+### Data
+Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:
+- **trajectory** — full multi-turn agent loop (system → user(ref-image) → assistant tool calls → tool results → ...)
+- **oneshot** — reference image → final TSX code in one turn
+- **turn** — predict the next assistant message given history up to turn k
+Final dataset (v3): 1,232 train / 66 val records, filtered to ≤23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.
+### Key Design Decisions
+- **Inline XML tool calls** — ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
+- **Visual feedback loop** — `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
+- **Loss scale weighting** — post-Render assistant turns weighted 2.5x to emphasize iteration behavior.
+### Config
+| | |
+|---|---|
+| **Framework** | ms-swift 4.2.1 |
+| **LoRA** | r=32, alpha=64, all-linear targets |
+| **Precision** | bfloat16 |
+| **Attention** | FlashAttention 2 |
+| **Max context** | 32,768 tokens |
+| **Max pixels** | 602,112 (576 tokens/image) |
+| **LR** | 5e-5, cosine schedule, 5% warmup |
+| **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) |
+## GGUF Quantizations
+| File | Size | Description |
+|------|------|-------------|
+| `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision |
+| `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) |
+| `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector |
 ## Limitations
+- Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
+- The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
+- Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
 - Requires `transformers >= 5.3` for `qwen3_5` model type support.
+- When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.
 ## Citation
 ```bibtex
+@misc{stone-preview-4b-2026,
+  title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
   author={Ejiro Pinnock},
   year={2026},
   url={https://huggingface.co/Scrymore/stone-preview-4b}