bonsai-image-binary-4B-gemlite-1bit

Binary weight (1-bit) text-to-image diffusion transformer deployment for NVIDIA GPUs

0.93 GB transformer | 8.3× smaller than FP16 | 4.5 s / 1024² on RTX 3080 | 2.7 s / 1024² on A100 | runs natively on Linux and Windows

Highlights

0.93 GB diffusion transformer, down from 7.75 GB for the FP16 FLUX.2 Klein 4B transformer
Binary {-1, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
4.09 GB CUDA deployment payload including the 4-bit text encoder and FP16 VAE — text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 — no CFG, no negative prompts needed
Gemlite low-bit GEMM path for NVIDIA GPUs, with HQQ used for the compressed text encoder
Runs on Linux and Windows natively through the same CUDA / Gemlite deployment stack
Cross-platform companion: also available as MLX 1-bit for Apple Silicon

Resources

Whitepaper — full benchmarks, kernels, and memory analysis
Demo repo — one-command setup for Mac / Linux / Windows
Discord — community + support
Kernels: gemlite (fused low-bit GEMM) · HQQ (low-bit quantization runtime) · triton-windows (Windows path)

Model Overview

Item	Specification
Base architecture	FLUX.2 Klein 4B (MMDiT diffusion transformer)
Parameters	~4.0B (transformer trunk)
Blocks	25 MMDiT blocks: 5 double-stream + 20 single-stream
Sampler	FlowMatchEuler, 4 steps, guidance = 1.0, shift = 3.0
Text encoder	Qwen3-4B at 4-bit HQQ (≈ 2.84 GB CUDA payload, offloaded after prompt encode)
VAE	Flux2 32-channel latent, tiled decode (128 px tiles)
Native resolution	1024×1024 (also supports 512×512 and arbitrary multiples of 32)
Weight format	Gemlite INT1 pack, binary values + FP16 group-wise scales
Transformer size	0.93 GB model-level Bonsai representation; 1.08 GB CUDA packed deployment size
Total payload	4.09 GB CUDA deployment payload (transformer + 4-bit text encoder + FP16 VAE)
1-bit coverage	All 100 matmul-heavy linears in the 25 MMDiT blocks
Platforms	Linux x86_64 + Windows native on NVIDIA GPUs
License	Apache 2.0

Binary Weight Representation: 1-bit g128

Each binary weight takes a value from {−1, +1} with one shared FP16 scale per group of 128 weights:

w_i = scale_g * b_i,    b_i in {−1, +1}

Binary values carry exactly 1 bit of information per weight. With one FP16 scale per group of 128, the effective storage is

b_eff ≈ 1 + 16/128 ≈ 1.125 bits/weight

This gives an idealized 14.2× reduction relative to FP16 for the binary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final 1-bit Bonsai Image 4B diffusion transformer is 0.93 GB, an 8.3x reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.

The binary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.

The CUDA deployment uses a Gemlite INT1 packed format. The model-level Bonsai representation is 0.93 GB; the deployed CUDA pack is 1.08 GB on disk due to runtime packing and alignment overhead in the current Gemlite path.

Memory

Format	Transformer size	Reduction	Ratio
FP16 FLUX.2 Klein 4B	7.75 GB	—	1.0×
1-bit Bonsai Image 4B	0.93 GB	88.0%	8.3×

CUDA deployment:

Component	Size
Gemlite INT1 diffusion transformer	1.08 GB
HQQ 4-bit text encoder	2.84 GB
FP16 VAE	0.17 GB
Total payload	4.09 GB

At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact binary diffusion transformer and active image-generation components rather than the full payload.

Peak HBM at 1024² on RTX 3080 is ~6.4 GiB end-to-end (transformer + VAE + activation memory).

Best Practices

Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0, shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
Resolution: native 1024² is the design target. 512² works for quick previews.
Aspect ratios: multiples of 32 are supported, including 832x1248 and 1248x832.
Prompting: natural-language prompts. Negative prompts are not required.
Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.

Quickstart

Bonsai Studio (Linux / Windows)

The simplest path is the Bonsai Image Demo repo, which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend) and selects gemlite automatically on Linux / Windows:

git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
BONSAI_VARIANT=binary ./scripts/download_model.sh
BONSAI_VARIANT=binary ./scripts/serve.sh

On Windows (PowerShell):

Set-ExecutionPolicy -Scope CurrentUser RemoteSigned   # one-time
.\setup.ps1
$env:BONSAI_VARIANT = 'binary'
.\scripts\download_model.ps1
.\scripts\serve.ps1

Python API (backend_gpu)

For inference without the studio frontend:

from backend_gpu.server import build_pipeline

pipe = build_pipeline(model_id="prism-ml/bonsai-image-binary-4B-gemlite-1bit")
image = pipe(
    prompt="A bonsai tree in a quiet ceramic studio, soft morning light",
    num_inference_steps=4,
    guidance_scale=1.0,
    height=1024,
    width=1024,
).images[0]
image.save("bonsai.png")

Throughput (CUDA / gemlite)

Warmed wall-clock per image, 4 sampler steps, guidance = 1.0, same prompts as the Mac and iPhone measurements. Linux + locally built gemlite kernels except where noted.

Platform	512² (s)	1024² (s)	Notes
A100 (Colab)	1.0	2.7	Ampere datacenter (40 GB)
RTX PRO 6000 Blackwell (Colab)	1.0	1.8	NVIDIA Blackwell, 96 GB VRAM
RTX 3080 10 GB	1.5	4.5	Ampere consumer; 6.4 GiB peak HBM at 1024²
RTX 3060 6 GB (laptop)	4.4	24.8	Ampere mobile; memory-bound at 1024²

The sub-2-bit pack is what keeps generation viable on commodity GPUs at 1024² — the consumer RTX 3080 reaches 4.5 s/image while the 6 GB laptop 3060 is the slow tail (memory-pressure limited).

Benchmarks

Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.

Model	Transformer (GB)	GenEval	HPSv3	DPG-Bench
Bonsai Image · Binary 4B	0.93	0.671	11.15	0.822
Bonsai Image · Ternary 4B	1.21	0.723	12.22	0.851
FLUX.2 Klein 4B	7.75	0.819	12.84	0.853
FLUX.1-schnell	23.8	0.716	12.67	0.848
SDXL	5.14	0.300	10.05	0.740
PixArt-Σ XL 2	1.20	0.541	11.93	0.769
Stable Diffusion 1.5	1.72	0.396	4.20	0.601
BK-SDM-Small	0.98	0.297	3.05	0.559

The benchmark results show the intended quality-footprint trade-off. 1-bit Bonsai Image 4B is the footprint-oriented variant: it reduces the diffusion transformer below 1 GB while still delivering strong GenEval, HPSv3, and DPG-Bench results. The ternary companion is the quality-oriented variant, using a slightly larger representation to achieve very close visual quality and prompt fidelity to the original FLUX.2 Klein 4B model.

Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.

Use Cases

Local creative tooling: image generation directly on CUDA-equipped workstations and consumer GPUs
Private generation: prompts and generated assets can remain in local or controlled environments
Rapid iteration: lower local latency and no remote queue for iterative creative workflows
Commodity-GPU serving: lower transformer footprint and reduced memory pressure for serving on NVIDIA GPUs
Windows and Linux deployment: native paths through the same Gemlite deployment stack
Enterprise and controlled inference: local or private environments for data residency and compliance-sensitive workflows

Limitations

1-bit Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact binary-weight deployment designed to deliver similar practical behavior at much smaller size.
Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
Current commodity inference stacks do not yet expose fully native binary execution as a standard hardware path. This release uses practical Gemlite low-bit GEMM kernels on CUDA.
After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.

Citation

@techreport{bonsaiimage4b,
    title   = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
    author  = {Prism ML},
    year    = {2026},
    month   = {May},
    url     = {https://prismml.com}
}