---
license: apache-2.0
pipeline_tag: text-to-image
tags:
- 1-bit
- gemlite
- hqq
- cuda
- text-to-image
- diffusion
- flux
- prismml
- bonsai
base_model:
- prism-ml/bonsai-image-binary-4B-unpacked
---

<p align="center">
  <img src="./assets/bonsai-logo.svg" width="280" alt="Bonsai Image">
</p>

<p align="center">
  <a href="https://prismml.com"><b>Prism ML Website</b></a> &nbsp;|&nbsp;
  <a href="https://github.com/PrismML-Eng/Bonsai-Image-Demo/blob/main/bonsai-image-4b-whitepaper.pdf"><b>Whitepaper</b></a> &nbsp;|&nbsp;
  <a href="https://github.com/PrismML-Eng/Bonsai-Image-Demo"><b>Demo &amp; Examples</b></a> &nbsp;|&nbsp;
  <a href="https://discord.gg/prismml"><b>Discord</b></a>
</p>

# bonsai-image-binary-4B-gemlite-1bit

Binary weight (1-bit) text-to-image diffusion transformer deployment for NVIDIA GPUs

> **0.93 GB transformer** | **8.3×** smaller than FP16 | **4.5 s / 1024²** on RTX 3080 | **2.7 s / 1024²** on A100 | runs natively on Linux and Windows

## Highlights

- **0.93 GB** diffusion transformer, down from **7.75 GB** for the FP16 FLUX.2 Klein 4B transformer
- Binary {-1, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
- 4.09 GB CUDA deployment payload including the 4-bit text encoder and FP16 VAE — text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
- 4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 — no CFG, no negative prompts needed
- Gemlite low-bit GEMM path for NVIDIA GPUs, with HQQ used for the compressed text encoder
- Runs on Linux and Windows natively through the same CUDA / Gemlite deployment stack
- Cross-platform companion: also available as [MLX 1-bit](https://huggingface.co/prism-ml/bonsai-image-binary-4B-mlx-1bit) for Apple Silicon

## Resources

- **[Whitepaper](https://github.com/PrismML-Eng/Bonsai-Image-Demo/blob/main/bonsai-image-4b-whitepaper.pdf)** — full benchmarks, kernels, and memory analysis
- **[Demo repo](https://github.com/PrismML-Eng/Bonsai-Image-Demo)** — one-command setup for Mac / Linux / Windows
- **[Discord](https://discord.gg/prismml)** — community + support
- **Kernels**: [gemlite](https://github.com/mobiusml/gemlite) (fused low-bit GEMM) · [HQQ](https://github.com/mobiusml/hqq) (low-bit quantization runtime) · [triton-windows](https://github.com/triton-lang/triton-windows) (Windows path)

## Model Overview

| Item                  | Specification                                                                                  |
| :-------------------- | :----------------------------------------------------------------------------------------------|
| Base architecture     | FLUX.2 Klein 4B (MMDiT diffusion transformer)                                                  |
| Parameters            | ~4.0B (transformer trunk)                                                                      |
| Blocks                | 25 MMDiT blocks: 5 double-stream + 20 single-stream                                            |
| Sampler               | FlowMatchEuler, **4 steps**, guidance = 1.0, shift = 3.0                                       |
| Text encoder          | Qwen3-4B at 4-bit HQQ (≈ 2.84 GB CUDA payload, offloaded after prompt encode)                  |
| VAE                   | Flux2 32-channel latent, tiled decode (128 px tiles)                                           |
| Native resolution     | 1024×1024 (also supports 512×512 and arbitrary multiples of 32)                                |
| Weight format         | Gemlite INT1 pack, binary values + FP16 group-wise scales                                      |
| **Transformer size**  | **0.93 GB** model-level Bonsai representation; **1.08 GB** CUDA packed deployment size         |
| Total payload         | **4.09 GB** CUDA deployment payload (transformer + 4-bit text encoder + FP16 VAE)              |
| 1-bit coverage        | All 100 matmul-heavy linears in the 25 MMDiT blocks                                            |
| Platforms             | Linux x86_64 + Windows native on NVIDIA GPUs                                                   |
| License               | Apache 2.0                                                                                     |

## Binary Weight Representation: 1-bit g128

Each binary weight takes a value from {−1, +1} with one shared FP16 scale per group of 128 weights:

```text
w_i = scale_g * b_i,    b_i in {−1, +1}
```

Binary values carry exactly 1 bit of information per weight. With one FP16 scale per group of 128, the effective storage is

```text
b_eff ≈ 1 + 16/128 ≈ 1.125 bits/weight
```

This gives an idealized **14.2× reduction** relative to FP16 for the binary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final 1-bit Bonsai Image 4B diffusion transformer is **0.93 GB**, an 8.3x reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.

The binary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.

The CUDA deployment uses a Gemlite INT1 packed format. The model-level Bonsai representation is **0.93 GB**; the deployed CUDA pack is **1.08 GB** on disk due to runtime packing and alignment overhead in the current Gemlite path.

### Memory

| Format                          | Transformer size | Reduction | Ratio    |
| :------------------------------ | ---------------: | --------: | -------: |
| FP16 FLUX.2 Klein 4B            | 7.75 GB          | —         | 1.0×     |
| **1-bit Bonsai Image 4B**       | **0.93 GB**    | **88.0%** | **8.3×** |

CUDA deployment:

| Component                       | Size    |
| :------------------------------ | ------: |
| Gemlite INT1 diffusion transformer | 1.08 GB |
| HQQ 4-bit text encoder          | 2.84 GB |
| FP16 VAE                        | 0.17 GB |
| **Total payload**    | **4.09 GB** |

At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact binary diffusion transformer and active image-generation components rather than the full payload.

Peak HBM at 1024² on RTX 3080 is ~6.4 GiB end-to-end (transformer + VAE + activation memory).

## Best Practices

- Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0, shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
- Resolution: native 1024² is the design target. 512² works for quick previews.
- Aspect ratios: multiples of 32 are supported, including 832x1248 and 1248x832.
- Prompting: natural-language prompts. Negative prompts are not required.
- Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.

## Quickstart

### Bonsai Studio (Linux / Windows)

The simplest path is the [Bonsai Image Demo repo](https://github.com/PrismML-Eng/Bonsai-Image-Demo), which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend) and selects gemlite automatically on Linux / Windows:

```bash
git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
BONSAI_VARIANT=binary ./scripts/download_model.sh
BONSAI_VARIANT=binary ./scripts/serve.sh
```

On Windows (PowerShell):

```powershell
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned   # one-time
.\setup.ps1
$env:BONSAI_VARIANT = 'binary'
.\scripts\download_model.ps1
.\scripts\serve.ps1
```

### Python API (backend_gpu)

For inference without the studio frontend:

```python
from backend_gpu.server import build_pipeline

pipe = build_pipeline(model_id="prism-ml/bonsai-image-binary-4B-gemlite-1bit")
image = pipe(
    prompt="A bonsai tree in a quiet ceramic studio, soft morning light",
    num_inference_steps=4,
    guidance_scale=1.0,
    height=1024,
    width=1024,
).images[0]
image.save("bonsai.png")
```

## Throughput (CUDA / gemlite)

Warmed wall-clock per image, 4 sampler steps, guidance = 1.0, same prompts as the Mac and iPhone measurements. Linux + locally built gemlite kernels except where noted.

| Platform                  | 512² (s) | 1024² (s) | Notes                                       |
| :------------------------ | -------: | --------: | :------------------------------------------ |
| **A100** (Colab)          | 1.0      | **2.7**   | Ampere datacenter (40 GB)                   |
| **RTX PRO 6000 Blackwell** (Colab) | 1.0 | **1.8** | NVIDIA Blackwell, 96 GB VRAM                |
| **RTX 3080** 10 GB        | 1.5      | **4.5**   | Ampere consumer; 6.4 GiB peak HBM at 1024²  |
| **RTX 3060** 6 GB (laptop)| 4.4      | 24.8      | Ampere mobile; memory-bound at 1024²        |

The sub-2-bit pack is what keeps generation viable on commodity GPUs at 1024² — the consumer RTX 3080 reaches 4.5 s/image while the 6 GB laptop 3060 is the slow tail (memory-pressure limited).

## Benchmarks

Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.

| Model                       | Transformer (GB) | GenEval | HPSv3  | DPG-Bench |
| :-------------------------- | ---------------: | ------: | -----: | --------: |
| **Bonsai Image · Binary 4B** | **0.93**        | **0.671** | **11.15** | **0.822** |
| **Bonsai Image · Ternary 4B**| **1.21**        | **0.723** | **12.22** | **0.851** |
| FLUX.2 Klein 4B             | 7.75             | 0.819   | 12.84  | 0.853     |
| FLUX.1-schnell              | 23.8             | 0.716   | 12.67  | 0.848     |
| SDXL                        | 5.14             | 0.300   | 10.05  | 0.740     |
| PixArt-Σ XL 2               | 1.20             | 0.541   | 11.93  | 0.769     |
| Stable Diffusion 1.5        | 1.72             | 0.396   | 4.20   | 0.601     |
| BK-SDM-Small                | 0.98             | 0.297   | 3.05   | 0.559     |

The benchmark results show the intended quality-footprint trade-off. 1-bit Bonsai Image 4B is the footprint-oriented variant: it reduces the diffusion transformer below 1 GB while still delivering strong GenEval, HPSv3, and DPG-Bench results. The ternary companion is the quality-oriented variant, using a slightly larger representation to achieve very close visual quality and prompt fidelity to the original FLUX.2 Klein 4B model.

Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.

## Use Cases

- **Local creative tooling**: image generation directly on CUDA-equipped workstations and consumer GPUs
- **Private generation**: prompts and generated assets can remain in local or controlled environments
- **Rapid iteration**: lower local latency and no remote queue for iterative creative workflows
- **Commodity-GPU serving**: lower transformer footprint and reduced memory pressure for serving on NVIDIA GPUs
- **Windows and Linux deployment**: native paths through the same Gemlite deployment stack
- **Enterprise and controlled inference**: local or private environments for data residency and compliance-sensitive workflows

## Limitations

- 1-bit Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact binary-weight deployment designed to deliver similar practical behavior at much smaller size.
- Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
- Current commodity inference stacks do not yet expose fully native binary execution as a standard hardware path. This release uses practical Gemlite low-bit GEMM kernels on CUDA.
- After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.


## Citation

```bibtex
@techreport{bonsaiimage4b,
    title   = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
    author  = {Prism ML},
    year    = {2026},
    month   = {May},
    url     = {https://prismml.com}
}
```

## Contact

For questions, feedback, or collaboration inquiries: **contact@prismml.com**