micromodel-ship

Offline Apple Silicon inference bundle for Qwen3-4B with DFlash exact speculative decoding.

Source code: github.com/florianleibert/micromodels

This repo hosts the shippable offline tarball (micromodel-ship-offline.tar.gz) that is too large to live in the GitHub repo. The tarball contains the full runnable runtime, both model payloads (target + draft), and the helper scripts needed to serve a local OpenAI-compatible API.

micromodel-ship-offline.tar.gz bundles:

target model: mlx-community/Qwen3-4B-bf16
DFlash draft: z-lab/Qwen3-4B-DFlash-b16
MLX-based runtime with exact speculative decoding
minimal OpenAI-compatible API server (POST /v1/chat/completions)
run, chat, serve, and benchmark scripts

Quick start

curl -L -o micromodel-ship-offline.tar.gz \
  https://huggingface.co/florianleibert/micromodel-ship/resolve/main/micromodel-ship-offline.tar.gz
tar -xzf micromodel-ship-offline.tar.gz
cd micromodel-ship
uv sync
./scripts/serve.sh

Health check:

curl http://127.0.0.1:8051/healthz

Performance

Measured on Apple M5 Max, macOS 26.4, parallel-replay verifier, 101-token prompt:

Max new tokens	Runtime	Generation tok/s	End-to-end tok/s
512	Plain MLX-LM BF16	55.13	48.67
512	DFlash BF16	190.73	186.89
1024	Plain MLX-LM BF16	48.18	44.05
1024	DFlash BF16	159.35	157.98

Observed: 3.46x decode / 3.84x end-to-end speedup at 512 tokens. Full numbers in the GitHub repo's PERFORMANCE.md.

Requirements

Apple Silicon (M-series)
macOS
Python with uv

Model tree for florianleibert/micromodel-ship

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

mlx-community/Qwen3-4B-bf16

Finetuned

(1)

this model

florianleibert
/

micromodel-ship

micromodel-ship

Contents

Quick start

Performance

Requirements

Links

Model tree for florianleibert/micromodel-ship