Image-Text-to-Text
Transformers
Safetensors
GGUF
English
qwen3_5
vision
vlm
agent
mobile-ui
react-native
tool-use
qwen3.5
lora
fine-tuned
conversational
Instructions to use Scrymore/stone-preview-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Scrymore/stone-preview-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Scrymore/stone-preview-4b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b") model = AutoModelForImageTextToText.from_pretrained("Scrymore/stone-preview-4b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use Scrymore/stone-preview-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Scrymore/stone-preview-4b", filename="stone-preview-4b-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Scrymore/stone-preview-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: llama-cli -hf Scrymore/stone-preview-4b:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: llama-cli -hf Scrymore/stone-preview-4b:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: ./llama-cli -hf Scrymore/stone-preview-4b:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Scrymore/stone-preview-4b:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Scrymore/stone-preview-4b:F16
Use Docker
docker model run hf.co/Scrymore/stone-preview-4b:F16
- LM Studio
- Jan
- vLLM
How to use Scrymore/stone-preview-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Scrymore/stone-preview-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scrymore/stone-preview-4b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Scrymore/stone-preview-4b:F16
- SGLang
How to use Scrymore/stone-preview-4b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Scrymore/stone-preview-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scrymore/stone-preview-4b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Scrymore/stone-preview-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scrymore/stone-preview-4b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use Scrymore/stone-preview-4b with Ollama:
ollama run hf.co/Scrymore/stone-preview-4b:F16
- Unsloth Studio new
How to use Scrymore/stone-preview-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Scrymore/stone-preview-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Scrymore/stone-preview-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Scrymore/stone-preview-4b to start chatting
- Pi new
How to use Scrymore/stone-preview-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Scrymore/stone-preview-4b:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Scrymore/stone-preview-4b:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Scrymore/stone-preview-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Scrymore/stone-preview-4b:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Scrymore/stone-preview-4b:F16
Run Hermes
hermes
- Docker Model Runner
How to use Scrymore/stone-preview-4b with Docker Model Runner:
docker model run hf.co/Scrymore/stone-preview-4b:F16
- Lemonade
How to use Scrymore/stone-preview-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Scrymore/stone-preview-4b:F16
Run and chat with the model
lemonade run user.stone-preview-4b-F16
List all available models
lemonade list
| library_name: transformers | |
| license: apache-2.0 | |
| base_model: Qwen/Qwen3.5-4B | |
| tags: | |
| - vision | |
| - vlm | |
| - agent | |
| - mobile-ui | |
| - react-native | |
| - tool-use | |
| - qwen3.5 | |
| - lora | |
| - fine-tuned | |
| pipeline_tag: image-text-to-text | |
| language: | |
| - en | |
| # Stone Preview 4B β Multimodal UI Agent | |
| A fine-tuned vision-language model that acts as a **React Native UI engineer**. Given a reference mobile app screenshot, it builds the matching screen by emitting tool calls (`Read`, `Write`, `Edit`, `Glob`, `Bash`, `Render`) and iterating on visual feedback until the render matches the reference. | |
| <p align="center"> | |
| <img src="assets/showcase.png" alt="Sample outputs from Stone Preview 4B" width="800"> | |
| <br> | |
| <em>Screens built by the model from reference screenshots β Nike, How We Feel, Vivid</em> | |
| </p> | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Base model** | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) (VLM with DeltaNet hybrid attention) | | |
| | **Architecture** | `Qwen3_5ForConditionalGeneration` β 32 layers, 2560 hidden, 16 attention heads | | |
| | **Parameters** | ~4B (bfloat16) | | |
| | **Fine-tuning** | LoRA (r=32, alpha=64, all-linear targets) via ms-swift 4.2.1 | | |
| | **Training data** | 1,232 agent traces (v3), each a multi-turn tool-calling loop with visual feedback | | |
| | **Context** | 32,768 tokens max | | |
| | **Hardware** | 4x H200 SXM | | |
| | **Format** | Merged safetensors (LoRA folded into base weights) | | |
| ## What the Model Does | |
| The model has learned two core behaviors: | |
| 1. **Visual reasoning** β analyze a reference screenshot to identify layout, UI components, colors, spacing, and hierarchy | |
| 2. **Tool-call emission** β given the conversation history, emit properly-formatted XML tool calls to read files, write code, render the result, and iterate | |
| Each training example is a hydrated agent trace: the model sees a reference screenshot, explores the project structure, writes React Native/Expo code, renders it, compares against the reference, and iterates until the output matches. | |
| ### Tool Call Format | |
| The model emits tool calls as inline XML in its responses: | |
| ```xml | |
| <tool_call> | |
| <function=Write> | |
| <parameter=file_path> | |
| app/(flows)/flow-001/screen-001.tsx | |
| </parameter> | |
| <parameter=content> | |
| import React from 'react'; | |
| import { View, Text, StyleSheet } from 'react-native'; | |
| // ... component code | |
| </parameter> | |
| </function> | |
| </tool_call> | |
| ``` | |
| Available tools: `Read`, `Write`, `Edit`, `Glob`, `Grep`, `Bash`, `Render`, `ToolSearch` | |
| ## Usage | |
| ### With vLLM (recommended) | |
| ```bash | |
| vllm serve Scrymore/stone-preview-4b \ | |
| --dtype bfloat16 \ | |
| --enable-auto-tool-choice \ | |
| --tool-call-parser qwen3_xml | |
| ``` | |
| ```python | |
| from openai import OpenAI | |
| import base64 | |
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") | |
| # Encode reference screenshot | |
| with open("reference.png", "rb") as f: | |
| img_b64 = base64.b64encode(f.read()).decode() | |
| response = client.chat.completions.create( | |
| model="Scrymore/stone-preview-4b", | |
| messages=[ | |
| { | |
| "role": "system", | |
| "content": ( | |
| "You are a React Native (Expo) UI engineer. Given a reference " | |
| "mobile-app screenshot, you build the matching screen by editing " | |
| "the project at WORKDIR. You have Read, Write, Edit, Glob, Grep, " | |
| "and Bash tools. Iterate by rendering and visually comparing to " | |
| "the reference, then stop when the render matches." | |
| ), | |
| }, | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}, | |
| {"type": "text", "text": ( | |
| "Reference screenshot above.\n" | |
| "WORKDIR: /workspace/my-app\n" | |
| "Target file: app/screen.tsx\n\n" | |
| "Build the screen that matches the reference." | |
| )}, | |
| ], | |
| }, | |
| ], | |
| tools=[...], # your tool schemas | |
| ) | |
| ``` | |
| ### With transformers | |
| ```python | |
| from transformers import Qwen3VLForConditionalGeneration, AutoProcessor | |
| model = Qwen3VLForConditionalGeneration.from_pretrained( | |
| "Scrymore/stone-preview-4b", | |
| torch_dtype="bfloat16", | |
| device_map="auto", | |
| ) | |
| processor = AutoProcessor.from_pretrained("Scrymore/stone-preview-4b") | |
| ``` | |
| > **Note:** Requires `transformers >= 5.3` for `qwen3_5` model type support. | |
| ## Training Details | |
| ### Data | |
| Each training record is a hydrated agent trace from a Claude-driven screen-building loop over 87 iOS apps (Mobbin corpus). Three SFT variants per trace: | |
| - **trajectory** β full multi-turn agent loop (system β user(ref-image) β assistant tool calls β tool results β ...) | |
| - **oneshot** β reference image β final TSX code in one turn | |
| - **turn** β predict the next assistant message given history up to turn k | |
| Final dataset (v3): 1,232 train / 66 val records, filtered to β€23k tokens. Mean 16.1k tokens, mean 4.3 render iterations per trace. | |
| ### Key Design Decisions | |
| - **Inline XML tool calls** β ms-swift's encoder reads `message['content']` only, silently dropping the `tool_calls` field. Tool calls are rendered as XML directly in the content string so they land in the loss-bearing token region. | |
| - **Visual feedback loop** β `Render` tool results include the rendered screenshot as an image, so the model learns to compare its output against the reference and iterate. | |
| - **Loss scale weighting** β post-Render assistant turns weighted 2.5x to emphasize iteration behavior. | |
| ### Config | |
| | | | | |
| |---|---| | |
| | **Framework** | ms-swift 4.2.1 | | |
| | **LoRA** | r=32, alpha=64, all-linear targets | | |
| | **Precision** | bfloat16 | | |
| | **Attention** | FlashAttention 2 | | |
| | **Max context** | 32,768 tokens | | |
| | **Max pixels** | 602,112 (576 tokens/image) | | |
| | **LR** | 5e-5, cosine schedule, 5% warmup | | |
| | **Effective batch** | 8 (BS=1 x GA=2 x 4 GPUs) | | |
| ## GGUF Quantizations | |
| | File | Size | Description | | |
| |------|------|-------------| | |
| | `stone-preview-4b-f16.gguf` | 8.4 GB | F16 full precision | | |
| | `stone-preview-4b-q4_k_m.gguf` | 2.7 GB | Q4_K_M quantized (5.13 BPW) | | |
| | `stone-preview-4b-mmproj-f16.gguf` | 25.6 MB | Vision projector | | |
| ## Limitations | |
| - Trained on iOS app screenshots only (87 apps from Mobbin). Android, web, and desktop UIs are untested. | |
| - The app corpus skews toward consumer apps (social, fintech, health, travel). Enterprise/B2B UIs may produce lower-quality results. | |
| - Works best when embedded in an agent loop with actual tool execution and visual feedback. Standalone generation without iterative rendering produces weaker results. | |
| - Requires `transformers >= 5.3` for `qwen3_5` model type support. | |
| - When serving with vLLM, the LoRA must be **merged** into base weights before serving. vLLM's LoRA loader silently drops vision-tower deltas. | |
| ## Citation | |
| ```bibtex | |
| @misc{stone-preview-4b-2026, | |
| title={Stone Preview 4B: A Multimodal Agent for Mobile UI Screen Building}, | |
| author={Ejiro Pinnock}, | |
| year={2026}, | |
| url={https://huggingface.co/Scrymore/stone-preview-4b} | |
| } | |
| ``` | |