Text-to-Image
Diffusers
Safetensors
Flux2KleinPipeline
1-bit
gemlite
hqq
cuda
diffusion
flux
prismml
bonsai
Instructions to use prism-ml/bonsai-image-binary-4B-gemlite-1bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use prism-ml/bonsai-image-binary-4B-gemlite-1bit with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("prism-ml/bonsai-image-binary-4B-gemlite-1bit", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
| license: apache-2.0 | |
| pipeline_tag: text-to-image | |
| tags: | |
| - 1-bit | |
| - gemlite | |
| - hqq | |
| - cuda | |
| - text-to-image | |
| - diffusion | |
| - flux | |
| - prismml | |
| - bonsai | |
| base_model: | |
| - prism-ml/bonsai-image-binary-4B-unpacked | |
| <p align="center"> | |
| <img src="./assets/bonsai-logo.svg" width="280" alt="Bonsai Image"> | |
| </p> | |
| <p align="center"> | |
| <a href="https://prismml.com"><b>Prism ML Website</b></a> | | |
| <a href="https://github.com/PrismML-Eng/Bonsai-Image-Demo/blob/main/bonsai-image-4b-whitepaper.pdf"><b>Whitepaper</b></a> | | |
| <a href="https://github.com/PrismML-Eng/Bonsai-Image-Demo"><b>Demo & Examples</b></a> | | |
| <a href="https://discord.gg/prismml"><b>Discord</b></a> | |
| </p> | |
| # bonsai-image-binary-4B-gemlite-1bit | |
| Binary weight (1-bit) text-to-image diffusion transformer deployment for NVIDIA GPUs | |
| > **0.93 GB transformer** | **8.3×** smaller than FP16 | **4.5 s / 1024²** on RTX 3080 | **2.7 s / 1024²** on A100 | runs natively on Linux and Windows | |
| ## Highlights | |
| - **0.93 GB** diffusion transformer, down from **7.75 GB** for the FP16 FLUX.2 Klein 4B transformer | |
| - Binary {-1, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights) | |
| - 4.09 GB CUDA deployment payload including the 4-bit text encoder and FP16 VAE — text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident | |
| - 4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 — no CFG, no negative prompts needed | |
| - Gemlite low-bit GEMM path for NVIDIA GPUs, with HQQ used for the compressed text encoder | |
| - Runs on Linux and Windows natively through the same CUDA / Gemlite deployment stack | |
| - Cross-platform companion: also available as [MLX 1-bit](https://huggingface.co/prism-ml/bonsai-image-binary-4B-mlx-1bit) for Apple Silicon | |
| ## Resources | |
| - **[Whitepaper](https://github.com/PrismML-Eng/Bonsai-Image-Demo/blob/main/bonsai-image-4b-whitepaper.pdf)** — full benchmarks, kernels, and memory analysis | |
| - **[Demo repo](https://github.com/PrismML-Eng/Bonsai-Image-Demo)** — one-command setup for Mac / Linux / Windows | |
| - **[Discord](https://discord.gg/prismml)** — community + support | |
| - **Kernels**: [gemlite](https://github.com/mobiusml/gemlite) (fused low-bit GEMM) · [HQQ](https://github.com/mobiusml/hqq) (low-bit quantization runtime) · [triton-windows](https://github.com/triton-lang/triton-windows) (Windows path) | |
| ## Model Overview | |
| | Item | Specification | | |
| | :-------------------- | :----------------------------------------------------------------------------------------------| | |
| | Base architecture | FLUX.2 Klein 4B (MMDiT diffusion transformer) | | |
| | Parameters | ~4.0B (transformer trunk) | | |
| | Blocks | 25 MMDiT blocks: 5 double-stream + 20 single-stream | | |
| | Sampler | FlowMatchEuler, **4 steps**, guidance = 1.0, shift = 3.0 | | |
| | Text encoder | Qwen3-4B at 4-bit HQQ (≈ 2.84 GB CUDA payload, offloaded after prompt encode) | | |
| | VAE | Flux2 32-channel latent, tiled decode (128 px tiles) | | |
| | Native resolution | 1024×1024 (also supports 512×512 and arbitrary multiples of 32) | | |
| | Weight format | Gemlite INT1 pack, binary values + FP16 group-wise scales | | |
| | **Transformer size** | **0.93 GB** model-level Bonsai representation; **1.08 GB** CUDA packed deployment size | | |
| | Total payload | **4.09 GB** CUDA deployment payload (transformer + 4-bit text encoder + FP16 VAE) | | |
| | 1-bit coverage | All 100 matmul-heavy linears in the 25 MMDiT blocks | | |
| | Platforms | Linux x86_64 + Windows native on NVIDIA GPUs | | |
| | License | Apache 2.0 | | |
| ## Binary Weight Representation: 1-bit g128 | |
| Each binary weight takes a value from {−1, +1} with one shared FP16 scale per group of 128 weights: | |
| ```text | |
| w_i = scale_g * b_i, b_i in {−1, +1} | |
| ``` | |
| Binary values carry exactly 1 bit of information per weight. With one FP16 scale per group of 128, the effective storage is | |
| ```text | |
| b_eff ≈ 1 + 16/128 ≈ 1.125 bits/weight | |
| ``` | |
| This gives an idealized **14.2× reduction** relative to FP16 for the binary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final 1-bit Bonsai Image 4B diffusion transformer is **0.93 GB**, an 8.3x reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer. | |
| The binary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability. | |
| The CUDA deployment uses a Gemlite INT1 packed format. The model-level Bonsai representation is **0.93 GB**; the deployed CUDA pack is **1.08 GB** on disk due to runtime packing and alignment overhead in the current Gemlite path. | |
| ### Memory | |
| | Format | Transformer size | Reduction | Ratio | | |
| | :------------------------------ | ---------------: | --------: | -------: | | |
| | FP16 FLUX.2 Klein 4B | 7.75 GB | — | 1.0× | | |
| | **1-bit Bonsai Image 4B** | **0.93 GB** | **88.0%** | **8.3×** | | |
| CUDA deployment: | |
| | Component | Size | | |
| | :------------------------------ | ------: | | |
| | Gemlite INT1 diffusion transformer | 1.08 GB | | |
| | HQQ 4-bit text encoder | 2.84 GB | | |
| | FP16 VAE | 0.17 GB | | |
| | **Total payload** | **4.09 GB** | | |
| At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact binary diffusion transformer and active image-generation components rather than the full payload. | |
| Peak HBM at 1024² on RTX 3080 is ~6.4 GiB end-to-end (transformer + VAE + activation memory). | |
| ## Best Practices | |
| - Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0, shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts. | |
| - Resolution: native 1024² is the design target. 512² works for quick previews. | |
| - Aspect ratios: multiples of 32 are supported, including 832x1248 and 1248x832. | |
| - Prompting: natural-language prompts. Negative prompts are not required. | |
| - Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light. | |
| ## Quickstart | |
| ### Bonsai Studio (Linux / Windows) | |
| The simplest path is the [Bonsai Image Demo repo](https://github.com/PrismML-Eng/Bonsai-Image-Demo), which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend) and selects gemlite automatically on Linux / Windows: | |
| ```bash | |
| git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git | |
| cd Bonsai-Image-Demo | |
| ./setup.sh | |
| BONSAI_VARIANT=binary ./scripts/download_model.sh | |
| BONSAI_VARIANT=binary ./scripts/serve.sh | |
| ``` | |
| On Windows (PowerShell): | |
| ```powershell | |
| Set-ExecutionPolicy -Scope CurrentUser RemoteSigned # one-time | |
| .\setup.ps1 | |
| $env:BONSAI_VARIANT = 'binary' | |
| .\scripts\download_model.ps1 | |
| .\scripts\serve.ps1 | |
| ``` | |
| ### Python API (backend_gpu) | |
| For inference without the studio frontend: | |
| ```python | |
| from backend_gpu.server import build_pipeline | |
| pipe = build_pipeline(model_id="prism-ml/bonsai-image-binary-4B-gemlite-1bit") | |
| image = pipe( | |
| prompt="A bonsai tree in a quiet ceramic studio, soft morning light", | |
| num_inference_steps=4, | |
| guidance_scale=1.0, | |
| height=1024, | |
| width=1024, | |
| ).images[0] | |
| image.save("bonsai.png") | |
| ``` | |
| ## Throughput (CUDA / gemlite) | |
| Warmed wall-clock per image, 4 sampler steps, guidance = 1.0, same prompts as the Mac and iPhone measurements. Linux + locally built gemlite kernels except where noted. | |
| | Platform | 512² (s) | 1024² (s) | Notes | | |
| | :------------------------ | -------: | --------: | :------------------------------------------ | | |
| | **A100** (Colab) | 1.0 | **2.7** | Ampere datacenter (40 GB) | | |
| | **RTX PRO 6000 Blackwell** (Colab) | 1.0 | **1.8** | NVIDIA Blackwell, 96 GB VRAM | | |
| | **RTX 3080** 10 GB | 1.5 | **4.5** | Ampere consumer; 6.4 GiB peak HBM at 1024² | | |
| | **RTX 3060** 6 GB (laptop)| 4.4 | 24.8 | Ampere mobile; memory-bound at 1024² | | |
| The sub-2-bit pack is what keeps generation viable on commodity GPUs at 1024² — the consumer RTX 3080 reaches 4.5 s/image while the 6 GB laptop 3060 is the slow tail (memory-pressure limited). | |
| ## Benchmarks | |
| Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks. | |
| | Model | Transformer (GB) | GenEval | HPSv3 | DPG-Bench | | |
| | :-------------------------- | ---------------: | ------: | -----: | --------: | | |
| | **Bonsai Image · Binary 4B** | **0.93** | **0.671** | **11.15** | **0.822** | | |
| | **Bonsai Image · Ternary 4B**| **1.21** | **0.723** | **12.22** | **0.851** | | |
| | FLUX.2 Klein 4B | 7.75 | 0.819 | 12.84 | 0.853 | | |
| | FLUX.1-schnell | 23.8 | 0.716 | 12.67 | 0.848 | | |
| | SDXL | 5.14 | 0.300 | 10.05 | 0.740 | | |
| | PixArt-Σ XL 2 | 1.20 | 0.541 | 11.93 | 0.769 | | |
| | Stable Diffusion 1.5 | 1.72 | 0.396 | 4.20 | 0.601 | | |
| | BK-SDM-Small | 0.98 | 0.297 | 3.05 | 0.559 | | |
| The benchmark results show the intended quality-footprint trade-off. 1-bit Bonsai Image 4B is the footprint-oriented variant: it reduces the diffusion transformer below 1 GB while still delivering strong GenEval, HPSv3, and DPG-Bench results. The ternary companion is the quality-oriented variant, using a slightly larger representation to achieve very close visual quality and prompt fidelity to the original FLUX.2 Klein 4B model. | |
| Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models. | |
| ## Use Cases | |
| - **Local creative tooling**: image generation directly on CUDA-equipped workstations and consumer GPUs | |
| - **Private generation**: prompts and generated assets can remain in local or controlled environments | |
| - **Rapid iteration**: lower local latency and no remote queue for iterative creative workflows | |
| - **Commodity-GPU serving**: lower transformer footprint and reduced memory pressure for serving on NVIDIA GPUs | |
| - **Windows and Linux deployment**: native paths through the same Gemlite deployment stack | |
| - **Enterprise and controlled inference**: local or private environments for data residency and compliance-sensitive workflows | |
| ## Limitations | |
| - 1-bit Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact binary-weight deployment designed to deliver similar practical behavior at much smaller size. | |
| - Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case. | |
| - Current commodity inference stacks do not yet expose fully native binary execution as a standard hardware path. This release uses practical Gemlite low-bit GEMM kernels on CUDA. | |
| - After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding. | |
| ## Citation | |
| ```bibtex | |
| @techreport{bonsaiimage4b, | |
| title = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs}, | |
| author = {Prism ML}, | |
| year = {2026}, | |
| month = {May}, | |
| url = {https://prismml.com} | |
| } | |
| ``` | |
| ## Contact | |
| For questions, feedback, or collaboration inquiries: **contact@prismml.com** | |