Instructions to use Scrymore/stone-preview-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Scrymore/stone-preview-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Scrymore/stone-preview-4b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
model = AutoModelForImageTextToText.from_pretrained("Scrymore/stone-preview-4b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use Scrymore/stone-preview-4b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Scrymore/stone-preview-4b",
	filename="stone-preview-4b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Scrymore/stone-preview-4b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
llama-cli -hf Scrymore/stone-preview-4b:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
llama-cli -hf Scrymore/stone-preview-4b:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
./llama-cli -hf Scrymore/stone-preview-4b:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Scrymore/stone-preview-4b:F16

Use Docker

docker model run hf.co/Scrymore/stone-preview-4b:F16

LM Studio
Jan

vLLM

How to use Scrymore/stone-preview-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Scrymore/stone-preview-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Scrymore/stone-preview-4b:F16

SGLang

How to use Scrymore/stone-preview-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Scrymore/stone-preview-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Scrymore/stone-preview-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use Scrymore/stone-preview-4b with Ollama:
```
ollama run hf.co/Scrymore/stone-preview-4b:F16
```

Unsloth Studio new

How to use Scrymore/stone-preview-4b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Pi new

How to use Scrymore/stone-preview-4b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Scrymore/stone-preview-4b:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Scrymore/stone-preview-4b:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Scrymore/stone-preview-4b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Scrymore/stone-preview-4b:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Scrymore/stone-preview-4b:F16

Run Hermes

hermes

Docker Model Runner
How to use Scrymore/stone-preview-4b with Docker Model Runner:
```
docker model run hf.co/Scrymore/stone-preview-4b:F16
```

Lemonade

How to use Scrymore/stone-preview-4b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Scrymore/stone-preview-4b:F16

Run and chat with the model

lemonade run user.stone-preview-4b-F16

List all available models

lemonade list

stone-preview-4b / README.md

epinnock

Upload README.md with huggingface_hub

f9f3eb7 verified 4 days ago

preview code

raw

history blame contribute delete

7.03 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: Qwen/Qwen3.5-4B
	tags:
	- vision
	- vlm
	- agent
	- mobile-ui
	- react-native
	- tool-use
	- qwen3.5
	- lora
	- fine-tuned
	pipeline_tag: image-text-to-text
	language:
	- en
	---

	# Stone Preview 4B — Multimodal UI Agent

	A fine-tuned vision-language model that acts as a React Native UI engineer. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.

	<p align="center">
	<img src="assets/showcase.png" alt="Sample outputs from Stone Preview 4B" width="800">
	<br>
	<em>Screens built by the model from reference screenshots — Nike, How We Feel, Vivid</em>
	</p>

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Base model \| [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) \|
	\| Architecture \| `Qwen3_5ForConditionalGeneration` — 32 layers, 2560 hidden, 16 attention heads \|
	\| Parameters \| ~4B (bfloat16) \|
	\| Fine-tuning \| LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 \|
	\| Training data \| 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback \|
	\| Context \| 32,768 tokens max \|
	\| Hardware \| 4x H200 SXM \|
	\| Format \| Merged safetensors (LoRA folded into base weights) \|

	## What the Model Does

	The model has learned two core behaviors:

	1. Visual reasoning — analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
	2. Tool-call emission — given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate

	Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.

	### Tool Call Format

	The model emits tool calls as inline XML in its responses:

	```xml
	<tool_call>
	<function=Write>
	<parameter=file_path>
	app/(flows)/flow-001/screen-001.tsx
	</parameter>
	<parameter=content>
	import React from 'react';
	import { View, Text, StyleSheet } from 'react-native';
	// ... component code
	</parameter>
	</function>
	</tool_call>
	```

	Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`

	## Usage

	### With vLLM (recommended)

	```bash
	vllm serve Scrymore/stone-preview-4b \
	--dtype bfloat16 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_xml
	```

	```python
	from openai import OpenAI
	import base64

	client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

	# Encode reference screenshot
	with open("reference.png", "rb") as f:
	img_b64 = base64.b64encode(f.read()).decode()

	response = client.chat.completions.create(
	model="Scrymore/stone-preview-4b",
	messages=[
	{
	"role": "system",
	"content": (
	"You are a React Native (Expo) UI engineer. Given a reference "
	"mobile-app screenshot, you build the matching screen by editing "
	"the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
	"and Bash tools. Iterate by rendering and visually comparing to "
	"the reference, then stop when the render matches."
	),
	},
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
	{"type": "text", "text": (
	"Reference screenshot above.\n"
	"WORKDIR: /workspace/my-app\n"
	"Target file: app/screen.tsx\n\n"
	"Build the screen that matches the reference."
	)},
	],
	},
	],
	tools=[...], # your tool schemas
	)
	```

	### With transformers

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

	model = Qwen3VLForConditionalGeneration.from_pretrained(
	"Scrymore/stone-preview-4b",
	torch_dtype="bfloat16",
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
	```

	> Note: Requires `transformers >= 5.3` for `qwen3_5` model type support.

	## Training Details

	### Data

	Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:

	- trajectory — full multi-turn agent loop (system → user(ref-image) → assistant tool calls → tool results → ...)
	- oneshot — reference image → final TSX code in one turn
	- turn — predict the next assistant message given history up to turn k

	Final dataset (v3): 1,232 train / 66 val records, filtered to ≤23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.

	### Key Design Decisions

	- Inline XML tool calls — ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
	- Visual feedback loop — `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
	- Loss scale weighting — post-Render assistant turns weighted 2.5x to emphasize iteration behavior.

	### Config

	\| \| \|
	\|---\|---\|
	\| Framework \| ms-swift 4.2.1 \|
	\| LoRA \| r=32, alpha=64, all-linear targets \|
	\| Precision \| bfloat16 \|
	\| Attention \| FlashAttention 2 \|
	\| Max context \| 32,768 tokens \|
	\| Max pixels \| 602,112 (576 tokens/image) \|
	\| LR \| 5e-5, cosine schedule, 5% warmup \|
	\| Effective batch \| 8 (BS=1 x GA=2 x 4 GPUs) \|

	## GGUF Quantizations

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `stone-preview-4b-f16.gguf` \| 8.4 GB \| F16 full precision \|
	\| `stone-preview-4b-q4_k_m.gguf` \| 2.7 GB \| Q4_K_M quantized (5.13 BPW) \|
	\| `stone-preview-4b-mmproj-f16.gguf` \| 25.6 MB \| Vision projector \|

	## Limitations

	- Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
	- The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
	- Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
	- Requires `transformers >= 5.3` for `qwen3_5` model type support.
	- When serving with vLLM, the LoRA must be merged into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.

	## Citation

	```bibtex
	@misc{stone-preview-4b-2026,
	title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
	author={Ejiro Pinnock},
	year={2026},
	url={https://huggingface.co/Scrymore/stone-preview-4b}
	}
	```