Bonsai Image

Prism ML Website  |  Whitepaper  |  Demo & Examples  |  Discord

bonsai-image-binary-4B-mlx-1bit

Binary weight (1-bit) text-to-image diffusion transformer deployment for Apple Silicon

0.93 GB transformer | 8.3ร— smaller than FP16 | 9.4 s / 512ยฒ on iPhone 17 Pro Max | 6 s / 512ยฒ on M4 Pro | runs on Mac, iPhone, iPad

Highlights

  • 0.93 GB diffusion transformer, down from 7.75 GB for the FP16 FLUX.2 Klein 4B transformer
  • Binary {โˆ’1, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
  • 3.42 GB Apple Silicon deployment payload including the 4-bit text encoder and FP16 VAE โ€” text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
  • 4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 โ€” no CFG, no negative prompts needed
  • MLX-native 1-bit format for Apple Silicon, the same kernel path as our 1-bit language-model releases
  • Cross-platform companion: also available as gemlite 1-bit for NVIDIA GPUs

Resources

  • Whitepaper โ€” full benchmarks, kernels, and memory analysis
  • Demo repo โ€” one-command setup for Mac / Linux / Windows
  • Discord โ€” community + support
  • Kernels: MLX fork (Apple Silicon) ยท mlx-swift fork (iOS / macOS) โ€” upstream PRs pending

Model Overview

Item Specification
Base architecture FLUX.2 Klein 4B (MMDiT diffusion transformer)
Parameters ~4.0B (transformer trunk)
Blocks 25 MMDiT blocks: 5 double-stream + 20 single-stream
Sampler FlowMatchEuler, 4 steps, guidance = 1.0, shift = 3.0
Text encoder Qwen3-4B at 4-bit (โ‰ˆ 2.28 GB on-device, offloaded after prompt encode)
VAE Flux2 32-channel latent, tiled decode (128 px tiles)
Native resolution 1024ร—1024 (also supports 512ร—512 and arbitrary multiples of 32)
Weight format MLX 1-bit g128, binary values + FP16 group-wise scales
Transformer size 0.93 GB (8.3ร— smaller than 7.75 GB FP16)
Total payload 3.42 GB (4.7x smaller than the 15.97 GB FP16 transformer + text encoder + VAE)
1-bit coverage All 100 matmul-heavy linears in the 25 MMDiT blocks
License Apache 2.0

Binary Weight Representation: 1-bit g128

Each binary weight takes a value from {โˆ’1, +1} with one shared FP16 scale per group of 128 weights:

w_i = scale_g * b_i,    b_i in {โˆ’1, +1}

Binary values carry exactly 1 bit of information per weight. With one FP16 scale per group of 128, the effective storage is

b_eff โ‰ˆ 1 + 16/128 โ‰ˆ 1.125 bits/weight

This gives an idealized 14.2ร— reduction relative to FP16 for the binary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final 1-bit Bonsai Image 4B diffusion transformer is 0.93 GB, an 8.3ร— reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.

The binary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.

Memory

Format Transformer size Reduction Ratio
FP16 FLUX.2 Klein 4B 7.75 GB โ€” 1.0ร—
1-bit Bonsai Image 4B 0.93 GB 88.0% 8.3ร—

Apple Silicon deployment:

Component Size
MLX 1-bit diffusion transformer 0.97 GB
Compressed text encoder 2.28 GB
FP16 VAE 0.17 GB
Total payload 3.42 GB

At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact binary diffusion transformer and active image-generation components rather than the full payload.

End-to-end Mac M4 Pro mean-active memory pressure at 1024ยฒ is 1.95 GB โ€” a 7.4ร— reduction vs the stock FP16 MFLUX pipeline (14.39 GB).

Best Practices

  • Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0 (no classifier-free guidance), shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
  • Resolution: native 1024ยฒ is the design target; 512ยฒ works for quick previews.
  • Aspect ratios: multiples of 32 are supported, including 832ร—1248 and 1248ร—832.
  • Prompting: natural-language prompts. Negative prompts are not required.
  • Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.

Quickstart

MLX (Python)

The simplest path is the Bonsai Image Demo repo, which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend):

git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
BONSAI_VARIANT=binary ./scripts/download_model.sh
BONSAI_VARIANT=binary ./scripts/serve.sh

For a one-shot render without the studio frontend:

BONSAI_VARIANT=binary ./scripts/generate.sh --prompt "A bonsai tree in a quiet ceramic studio, soft morning light"

MLX Swift (iOS / macOS)

Binary Bonsai Image 4B runs natively on iPhone and iPad via MLX Swift. Bonsai Studio for iPhone is available on the App Store; under the hood, it loads this model with the kernels in our mlx-swift fork.

Throughput (MLX / Apple Silicon)

Mac M4 Pro (48 GB unified memory), 4 denoising steps, fixed prompt and seed:

Resolution s / step s / image (mean ยฑ std) vs stock MFLUX FP16
512 ร— 512 1.50 6.01 ยฑ 0.31 s 3.03ร—
1024 ร— 1024 6.02 24.07 ยฑ 0.03 s 5.60ร—

iPhone 17 Pro Max (A19 Pro, 12 GB unified memory), MLX Swift, same methodology:

Resolution s / step s / image
128 ร— 128 0.68 2.7 s
256 ร— 256 0.95 3.8 s
512 ร— 512 2.35 9.4 s
1024 ร— 1024 8.15 32.6 s

Stock FP16 FLUX.2 Klein 4B does not fit within iPhone 17 Pro Max's 12 GB unified memory budget; Bonsai Image 4B models do.

Benchmarks

Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.

Model Transformer (GB) GenEval HPSv3 DPG-Bench
Bonsai Image ยท Binary 4B 0.93 0.671 11.15 0.822
Bonsai Image ยท Ternary 4B 1.21 0.723 12.22 0.851
FLUX.2 Klein 4B 7.75 0.819 12.84 0.853
FLUX.1-schnell 23.8 0.716 12.67 0.848
SDXL 5.14 0.300 10.05 0.740
PixArt-ฮฃ XL 2 1.20 0.541 11.93 0.769
Stable Diffusion 1.5 1.72 0.396 4.20 0.601
BK-SDM-Small 0.98 0.297 3.05 0.559

The benchmark results show the intended quality-footprint trade-off. 1-bit Bonsai Image 4B is the footprint-oriented variant: it reduces the diffusion transformer below 1 GB while still delivering strong GenEval, HPSv3, and DPG-Bench results. The ternary companion is the quality-oriented variant, using a slightly larger representation to achieve very close visual quality and prompt fidelity to the original FLUX.2 Klein 4B model.

Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.

Use Cases

  • Local creative tooling: image generation directly on Mac, iPhone, and iPad
  • Private generation: prompts and generated assets can remain local
  • Rapid iteration: lower local latency and no remote queue for iterative creative workflows
  • Mobile deployment: image generation on devices with unified-memory, thermal, and connectivity constraints
  • Commodity-GPU serving: lower transformer footprint and reduced memory pressure for serving on CUDA GPUs
  • Enterprise and controlled inference: local or private environments for data residency and compliance-sensitive workflows

Limitations

  • 1-bit Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact binary-weight deployment designed to deliver similar practical behavior at much smaller size.
  • Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
  • Current commodity inference stacks do not yet expose fully native binary execution as a standard hardware path. This release uses practical MLX low-bit kernel paths on Apple Silicon and Gemlite low-bit GEMM on CUDA.
  • After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.

Citation

@techreport{bonsaiimage4b,
    title   = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
    author  = {Prism ML},
    year    = {2026},
    month   = {May},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

1-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for prism-ml/bonsai-image-binary-4B-mlx-1bit

Finetuned
(2)
this model

Spaces using prism-ml/bonsai-image-binary-4B-mlx-1bit 5

Collection including prism-ml/bonsai-image-binary-4B-mlx-1bit