bonsai-image-binary-4B-mlx-1bit

Binary weight (1-bit) text-to-image diffusion transformer deployment for Apple Silicon

0.93 GB transformer | 8.3× smaller than FP16 | 9.4 s / 512² on iPhone 17 Pro Max | 6 s / 512² on M4 Pro | runs on Mac, iPhone, iPad

Highlights

0.93 GB diffusion transformer, down from 7.75 GB for the FP16 FLUX.2 Klein 4B transformer
Binary {−1, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
3.42 GB Apple Silicon deployment payload including the 4-bit text encoder and FP16 VAE — text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 — no CFG, no negative prompts needed
MLX-native 1-bit format for Apple Silicon, the same kernel path as our 1-bit language-model releases
Cross-platform companion: also available as gemlite 1-bit for NVIDIA GPUs

Resources

Whitepaper — full benchmarks, kernels, and memory analysis
Demo repo — one-command setup for Mac / Linux / Windows
Discord — community + support
Kernels: MLX fork (Apple Silicon) · mlx-swift fork (iOS / macOS) — upstream PRs pending

Model Overview

Item	Specification
Base architecture	FLUX.2 Klein 4B (MMDiT diffusion transformer)
Parameters	~4.0B (transformer trunk)
Blocks	25 MMDiT blocks: 5 double-stream + 20 single-stream
Sampler	FlowMatchEuler, 4 steps, guidance = 1.0, shift = 3.0
Text encoder	Qwen3-4B at 4-bit (≈ 2.28 GB on-device, offloaded after prompt encode)
VAE	Flux2 32-channel latent, tiled decode (128 px tiles)
Native resolution	1024×1024 (also supports 512×512 and arbitrary multiples of 32)
Weight format	MLX 1-bit g128, binary values + FP16 group-wise scales
Transformer size	0.93 GB (8.3× smaller than 7.75 GB FP16)
Total payload	3.42 GB (4.7x smaller than the 15.97 GB FP16 transformer + text encoder + VAE)
1-bit coverage	All 100 matmul-heavy linears in the 25 MMDiT blocks
License	Apache 2.0

Binary Weight Representation: 1-bit g128

Each binary weight takes a value from {−1, +1} with one shared FP16 scale per group of 128 weights:

w_i = scale_g * b_i,    b_i in {−1, +1}

Binary values carry exactly 1 bit of information per weight. With one FP16 scale per group of 128, the effective storage is

b_eff ≈ 1 + 16/128 ≈ 1.125 bits/weight

This gives an idealized 14.2× reduction relative to FP16 for the binary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final 1-bit Bonsai Image 4B diffusion transformer is 0.93 GB, an 8.3× reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.

The binary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.

Memory

Format	Transformer size	Reduction	Ratio
FP16 FLUX.2 Klein 4B	7.75 GB	—	1.0×
1-bit Bonsai Image 4B	0.93 GB	88.0%	8.3×

Apple Silicon deployment:

Component	Size
MLX 1-bit diffusion transformer	0.97 GB
Compressed text encoder	2.28 GB
FP16 VAE	0.17 GB
Total payload	3.42 GB

At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact binary diffusion transformer and active image-generation components rather than the full payload.

End-to-end Mac M4 Pro mean-active memory pressure at 1024² is 1.95 GB — a 7.4× reduction vs the stock FP16 MFLUX pipeline (14.39 GB).

Best Practices

Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0 (no classifier-free guidance), shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
Resolution: native 1024² is the design target; 512² works for quick previews.
Aspect ratios: multiples of 32 are supported, including 832×1248 and 1248×832.
Prompting: natural-language prompts. Negative prompts are not required.
Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.

Quickstart

MLX (Python)

The simplest path is the Bonsai Image Demo repo, which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend):

git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
BONSAI_VARIANT=binary ./scripts/download_model.sh
BONSAI_VARIANT=binary ./scripts/serve.sh

For a one-shot render without the studio frontend:

BONSAI_VARIANT=binary ./scripts/generate.sh --prompt "A bonsai tree in a quiet ceramic studio, soft morning light"

MLX Swift (iOS / macOS)

Binary Bonsai Image 4B runs natively on iPhone and iPad via MLX Swift. Bonsai Studio for iPhone is available on the App Store; under the hood, it loads this model with the kernels in our mlx-swift fork.

Throughput (MLX / Apple Silicon)

Mac M4 Pro (48 GB unified memory), 4 denoising steps, fixed prompt and seed:

Resolution	s / step	s / image (mean ± std)	vs stock MFLUX FP16
512 × 512	1.50	6.01 ± 0.31 s	3.03×
1024 × 1024	6.02	24.07 ± 0.03 s	5.60×

iPhone 17 Pro Max (A19 Pro, 12 GB unified memory), MLX Swift, same methodology:

Resolution	s / step	s / image
128 × 128	0.68	2.7 s
256 × 256	0.95	3.8 s
512 × 512	2.35	9.4 s
1024 × 1024	8.15	32.6 s

Stock FP16 FLUX.2 Klein 4B does not fit within iPhone 17 Pro Max's 12 GB unified memory budget; Bonsai Image 4B models do.

Benchmarks

Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.

Model	Transformer (GB)	GenEval	HPSv3	DPG-Bench
Bonsai Image · Binary 4B	0.93	0.671	11.15	0.822
Bonsai Image · Ternary 4B	1.21	0.723	12.22	0.851
FLUX.2 Klein 4B	7.75	0.819	12.84	0.853
FLUX.1-schnell	23.8	0.716	12.67	0.848
SDXL	5.14	0.300	10.05	0.740
PixArt-Σ XL 2	1.20	0.541	11.93	0.769
Stable Diffusion 1.5	1.72	0.396	4.20	0.601
BK-SDM-Small	0.98	0.297	3.05	0.559

The benchmark results show the intended quality-footprint trade-off. 1-bit Bonsai Image 4B is the footprint-oriented variant: it reduces the diffusion transformer below 1 GB while still delivering strong GenEval, HPSv3, and DPG-Bench results. The ternary companion is the quality-oriented variant, using a slightly larger representation to achieve very close visual quality and prompt fidelity to the original FLUX.2 Klein 4B model.

Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.

Use Cases

Local creative tooling: image generation directly on Mac, iPhone, and iPad
Private generation: prompts and generated assets can remain local
Rapid iteration: lower local latency and no remote queue for iterative creative workflows
Mobile deployment: image generation on devices with unified-memory, thermal, and connectivity constraints
Commodity-GPU serving: lower transformer footprint and reduced memory pressure for serving on CUDA GPUs
Enterprise and controlled inference: local or private environments for data residency and compliance-sensitive workflows

Limitations

1-bit Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact binary-weight deployment designed to deliver similar practical behavior at much smaller size.
Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
Current commodity inference stacks do not yet expose fully native binary execution as a standard hardware path. This release uses practical MLX low-bit kernel paths on Apple Silicon and Gemlite low-bit GEMM on CUDA.
After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.

Citation

@techreport{bonsaiimage4b,
    title   = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
    author  = {Prism ML},
    year    = {2026},
    month   = {May},
    url     = {https://prismml.com}
}