Instructions to use Scrymore/stone-preview-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Scrymore/stone-preview-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Scrymore/stone-preview-4b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
model = AutoModelForImageTextToText.from_pretrained("Scrymore/stone-preview-4b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use Scrymore/stone-preview-4b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Scrymore/stone-preview-4b",
	filename="stone-preview-4b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Scrymore/stone-preview-4b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
llama-cli -hf Scrymore/stone-preview-4b:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
llama-cli -hf Scrymore/stone-preview-4b:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
./llama-cli -hf Scrymore/stone-preview-4b:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Scrymore/stone-preview-4b:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Scrymore/stone-preview-4b:F16

Use Docker

docker model run hf.co/Scrymore/stone-preview-4b:F16

LM Studio
Jan

vLLM

How to use Scrymore/stone-preview-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Scrymore/stone-preview-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Scrymore/stone-preview-4b:F16

SGLang

How to use Scrymore/stone-preview-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Scrymore/stone-preview-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Scrymore/stone-preview-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Scrymore/stone-preview-4b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use Scrymore/stone-preview-4b with Ollama:
```
ollama run hf.co/Scrymore/stone-preview-4b:F16
```

Unsloth Studio new

How to use Scrymore/stone-preview-4b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Scrymore/stone-preview-4b to start chatting

Pi new

How to use Scrymore/stone-preview-4b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Scrymore/stone-preview-4b:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Scrymore/stone-preview-4b:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Scrymore/stone-preview-4b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Scrymore/stone-preview-4b:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Scrymore/stone-preview-4b:F16

Run Hermes

hermes

Docker Model Runner
How to use Scrymore/stone-preview-4b with Docker Model Runner:
```
docker model run hf.co/Scrymore/stone-preview-4b:F16
```

Lemonade

How to use Scrymore/stone-preview-4b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Scrymore/stone-preview-4b:F16

Run and chat with the model

lemonade run user.stone-preview-4b-F16

List all available models

lemonade list

stone-preview-4b

File size: 7,031 Bytes

---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
tags:
  - vision
  - vlm
  - agent
  - mobile-ui
  - react-native
  - tool-use
  - qwen3.5
  - lora
  - fine-tuned
pipeline_tag: image-text-to-text
language:
  - en
---

# Stone Preview 4B — Multimodal UI Agent

A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference.

<p align="center">
  <img src="assets/showcase.png" alt="Sample outputs from Stone Preview 4B" width="800">
  <br>
  <em>Screens built by the model from reference screenshots — Nike, How We Feel, Vivid</em>
</p>

## Model Details

| | |
|---|---|
| **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) |
| **Architecture** | `Qwen3_5ForConditionalGeneration` — 32 layers, 2560 hidden, 16 attention heads |
| **Parameters** | ~4B (bfloat16) |
| **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 |
| **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback |
| **Context** | 32,768 tokens max |
| **Hardware** | 4x H200 SXM |
| **Format** | Merged safetensors (LoRA folded into base weights) |

## What the Model Does

The model has learned two core behaviors:

1. **Visual reasoning** — analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy
2. **Tool-call emission** — given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate

Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches.

### Tool Call Format

The model emits tool calls as inline XML in its responses:

```xml
<tool_call>
<function=Write>
<parameter=file_path>
app/(flows)/flow-001/screen-001.tsx
</parameter>
<parameter=content>
import React from 'react';
import { View, Text, StyleSheet } from 'react-native';
// ... component code
</parameter>
</function>
</tool_call>
```

Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch`

## Usage

### With vLLM (recommended)

```bash
vllm serve Scrymore/stone-preview-4b \
  --dtype bfloat16 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml
```

```python
from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Encode reference screenshot
with open("reference.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Scrymore/stone-preview-4b",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a React Native (Expo) UI engineer. Given a reference "
                "mobile-app screenshot, you build the matching screen by editing "
                "the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, "
                "and Bash tools. Iterate by rendering and visually comparing to "
                "the reference, then stop when the render matches."
            ),
        },
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
                {"type": "text", "text": (
                    "Reference screenshot above.\n"
                    "WORKDIR: /workspace/my-app\n"
                    "Target file: app/screen.tsx\n\n"
                    "Build the screen that matches the reference."
                )},
            ],
        },
    ],
    tools=[...],  # your tool schemas
)
```

### With transformers

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Scrymore/stone-preview-4b",
    torch_dtype="bfloat16",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b")
```

> **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support.

## Training Details

### Data

Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace:

- **trajectory** — full multi-turn agent loop (system → user(ref-image) → assistant tool calls → tool results → ...)
- **oneshot** — reference image → final TSX code in one turn
- **turn** — predict the next assistant message given history up to turn k

Final dataset (v3): 1,232 train / 66 val records, filtered to ≤23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace.

### Key Design Decisions

- **Inline XML tool calls** — ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region.
- **Visual feedback loop** — `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate.
- **Loss scale weighting** — post-Render assistant turns weighted 2.5x to emphasize iteration behavior.

### Config

| | |
|---|---|
| **Framework** | ms-swift 4.2.1 |
| **LoRA** | r=32, alpha=64, all-linear targets |
| **Precision** | bfloat16 |
| **Attention** | FlashAttention 2 |
| **Max context** | 32,768 tokens |
| **Max pixels** | 602,112 (576 tokens/image) |
| **LR** | 5e-5, cosine schedule, 5% warmup |
| **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) |

## GGUF Quantizations

| File | Size | Description |
|------|------|-------------|
| `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision |
| `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) |
| `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector |

## Limitations

- Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested.
- The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results.
- Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results.
- Requires `transformers >= 5.3` for `qwen3_5` model type support.
- When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas.

## Citation

```bibtex
@misc{stone-preview-4b-2026,
  title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building},
  author={Ejiro Pinnock},
  year={2026},
  url={https://huggingface.co/Scrymore/stone-preview-4b}
}
```