Instructions to use prism-ml/bonsai-image-binary-4B-gemlite-1bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use prism-ml/bonsai-image-binary-4B-gemlite-1bit with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("prism-ml/bonsai-image-binary-4B-gemlite-1bit", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("prism-ml/bonsai-image-binary-4B-gemlite-1bit", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]
Prism ML Website | Whitepaper | Demo & Examples | Discord
bonsai-image-binary-4B-gemlite-1bit
Binary weight (1-bit) text-to-image diffusion transformer deployment for NVIDIA GPUs
0.93 GB transformer | 8.3× smaller than FP16 | 4.5 s / 1024² on RTX 3080 | 2.7 s / 1024² on A100 | runs natively on Linux and Windows
Highlights
- 0.93 GB diffusion transformer, down from 7.75 GB for the FP16 FLUX.2 Klein 4B transformer
- Binary {-1, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
- 4.09 GB CUDA deployment payload including the 4-bit text encoder and FP16 VAE — text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
- 4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 — no CFG, no negative prompts needed
- Gemlite low-bit GEMM path for NVIDIA GPUs, with HQQ used for the compressed text encoder
- Runs on Linux and Windows natively through the same CUDA / Gemlite deployment stack
- Cross-platform companion: also available as MLX 1-bit for Apple Silicon
Resources
- Whitepaper — full benchmarks, kernels, and memory analysis
- Demo repo — one-command setup for Mac / Linux / Windows
- Discord — community + support
- Kernels: gemlite (fused low-bit GEMM) · HQQ (low-bit quantization runtime) · triton-windows (Windows path)
Model Overview
| Item | Specification |
|---|---|
| Base architecture | FLUX.2 Klein 4B (MMDiT diffusion transformer) |
| Parameters | ~4.0B (transformer trunk) |
| Blocks | 25 MMDiT blocks: 5 double-stream + 20 single-stream |
| Sampler | FlowMatchEuler, 4 steps, guidance = 1.0, shift = 3.0 |
| Text encoder | Qwen3-4B at 4-bit HQQ (≈ 2.84 GB CUDA payload, offloaded after prompt encode) |
| VAE | Flux2 32-channel latent, tiled decode (128 px tiles) |
| Native resolution | 1024×1024 (also supports 512×512 and arbitrary multiples of 32) |
| Weight format | Gemlite INT1 pack, binary values + FP16 group-wise scales |
| Transformer size | 0.93 GB model-level Bonsai representation; 1.08 GB CUDA packed deployment size |
| Total payload | 4.09 GB CUDA deployment payload (transformer + 4-bit text encoder + FP16 VAE) |
| 1-bit coverage | All 100 matmul-heavy linears in the 25 MMDiT blocks |
| Platforms | Linux x86_64 + Windows native on NVIDIA GPUs |
| License | Apache 2.0 |
Binary Weight Representation: 1-bit g128
Each binary weight takes a value from {−1, +1} with one shared FP16 scale per group of 128 weights:
w_i = scale_g * b_i, b_i in {−1, +1}
Binary values carry exactly 1 bit of information per weight. With one FP16 scale per group of 128, the effective storage is
b_eff ≈ 1 + 16/128 ≈ 1.125 bits/weight
This gives an idealized 14.2× reduction relative to FP16 for the binary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final 1-bit Bonsai Image 4B diffusion transformer is 0.93 GB, an 8.3x reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.
The binary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.
The CUDA deployment uses a Gemlite INT1 packed format. The model-level Bonsai representation is 0.93 GB; the deployed CUDA pack is 1.08 GB on disk due to runtime packing and alignment overhead in the current Gemlite path.
Memory
| Format | Transformer size | Reduction | Ratio |
|---|---|---|---|
| FP16 FLUX.2 Klein 4B | 7.75 GB | — | 1.0× |
| 1-bit Bonsai Image 4B | 0.93 GB | 88.0% | 8.3× |
CUDA deployment:
| Component | Size |
|---|---|
| Gemlite INT1 diffusion transformer | 1.08 GB |
| HQQ 4-bit text encoder | 2.84 GB |
| FP16 VAE | 0.17 GB |
| Total payload | 4.09 GB |
At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact binary diffusion transformer and active image-generation components rather than the full payload.
Peak HBM at 1024² on RTX 3080 is ~6.4 GiB end-to-end (transformer + VAE + activation memory).
Best Practices
- Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0, shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
- Resolution: native 1024² is the design target. 512² works for quick previews.
- Aspect ratios: multiples of 32 are supported, including 832x1248 and 1248x832.
- Prompting: natural-language prompts. Negative prompts are not required.
- Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.
Quickstart
Bonsai Studio (Linux / Windows)
The simplest path is the Bonsai Image Demo repo, which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend) and selects gemlite automatically on Linux / Windows:
git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
BONSAI_VARIANT=binary ./scripts/download_model.sh
BONSAI_VARIANT=binary ./scripts/serve.sh
On Windows (PowerShell):
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned # one-time
.\setup.ps1
$env:BONSAI_VARIANT = 'binary'
.\scripts\download_model.ps1
.\scripts\serve.ps1
Python API (backend_gpu)
For inference without the studio frontend:
from backend_gpu.server import build_pipeline
pipe = build_pipeline(model_id="prism-ml/bonsai-image-binary-4B-gemlite-1bit")
image = pipe(
prompt="A bonsai tree in a quiet ceramic studio, soft morning light",
num_inference_steps=4,
guidance_scale=1.0,
height=1024,
width=1024,
).images[0]
image.save("bonsai.png")
Throughput (CUDA / gemlite)
Warmed wall-clock per image, 4 sampler steps, guidance = 1.0, same prompts as the Mac and iPhone measurements. Linux + locally built gemlite kernels except where noted.
| Platform | 512² (s) | 1024² (s) | Notes |
|---|---|---|---|
| A100 (Colab) | 1.0 | 2.7 | Ampere datacenter (40 GB) |
| RTX PRO 6000 Blackwell (Colab) | 1.0 | 1.8 | NVIDIA Blackwell, 96 GB VRAM |
| RTX 3080 10 GB | 1.5 | 4.5 | Ampere consumer; 6.4 GiB peak HBM at 1024² |
| RTX 3060 6 GB (laptop) | 4.4 | 24.8 | Ampere mobile; memory-bound at 1024² |
The sub-2-bit pack is what keeps generation viable on commodity GPUs at 1024² — the consumer RTX 3080 reaches 4.5 s/image while the 6 GB laptop 3060 is the slow tail (memory-pressure limited).
Benchmarks
Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.
| Model | Transformer (GB) | GenEval | HPSv3 | DPG-Bench |
|---|---|---|---|---|
| Bonsai Image · Binary 4B | 0.93 | 0.671 | 11.15 | 0.822 |
| Bonsai Image · Ternary 4B | 1.21 | 0.723 | 12.22 | 0.851 |
| FLUX.2 Klein 4B | 7.75 | 0.819 | 12.84 | 0.853 |
| FLUX.1-schnell | 23.8 | 0.716 | 12.67 | 0.848 |
| SDXL | 5.14 | 0.300 | 10.05 | 0.740 |
| PixArt-Σ XL 2 | 1.20 | 0.541 | 11.93 | 0.769 |
| Stable Diffusion 1.5 | 1.72 | 0.396 | 4.20 | 0.601 |
| BK-SDM-Small | 0.98 | 0.297 | 3.05 | 0.559 |
The benchmark results show the intended quality-footprint trade-off. 1-bit Bonsai Image 4B is the footprint-oriented variant: it reduces the diffusion transformer below 1 GB while still delivering strong GenEval, HPSv3, and DPG-Bench results. The ternary companion is the quality-oriented variant, using a slightly larger representation to achieve very close visual quality and prompt fidelity to the original FLUX.2 Klein 4B model.
Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.
Use Cases
- Local creative tooling: image generation directly on CUDA-equipped workstations and consumer GPUs
- Private generation: prompts and generated assets can remain in local or controlled environments
- Rapid iteration: lower local latency and no remote queue for iterative creative workflows
- Commodity-GPU serving: lower transformer footprint and reduced memory pressure for serving on NVIDIA GPUs
- Windows and Linux deployment: native paths through the same Gemlite deployment stack
- Enterprise and controlled inference: local or private environments for data residency and compliance-sensitive workflows
Limitations
- 1-bit Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact binary-weight deployment designed to deliver similar practical behavior at much smaller size.
- Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
- Current commodity inference stacks do not yet expose fully native binary execution as a standard hardware path. This release uses practical Gemlite low-bit GEMM kernels on CUDA.
- After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.
Citation
@techreport{bonsaiimage4b,
title = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
author = {Prism ML},
year = {2026},
month = {May},
url = {https://prismml.com}
}
Contact
For questions, feedback, or collaboration inquiries: contact@prismml.com
- Downloads last month
- -
Model tree for prism-ml/bonsai-image-binary-4B-gemlite-1bit
Base model
prism-ml/bonsai-image-binary-4B-unpacked