initial commit

6a42f1b 11 days ago

12.8 kB

	---
	license: apache-2.0
	pipeline_tag: text-to-image
	tags:
	- 1-bit
	- gemlite
	- hqq
	- cuda
	- text-to-image
	- diffusion
	- flux
	- prismml
	- bonsai
	base_model:
	- prism-ml/bonsai-image-binary-4B-unpacked
	---

	<p align="center">
	<img src="./assets/bonsai-logo.svg" width="280" alt="Bonsai Image">
	</p>

	<p align="center">
	<a href="https://prismml.com"><b>Prism ML Website</b></a>  \|
	<a href="https://github.com/PrismML-Eng/Bonsai-Image-Demo/blob/main/bonsai-image-4b-whitepaper.pdf"><b>Whitepaper</b></a>  \|
	<a href="https://github.com/PrismML-Eng/Bonsai-Image-Demo"><b>Demo & Examples</b></a>  \|
	<a href="https://discord.gg/prismml"><b>Discord</b></a>
	</p>

	# bonsai-image-binary-4B-gemlite-1bit

	Binary weight (1-bit) text-to-image diffusion transformer deployment for NVIDIA GPUs

	> 0.93 GB transformer \| 8.3× smaller than FP16 \| 4.5 s / 1024² on RTX 3080 \| 2.7 s / 1024² on A100 \| runs natively on Linux and Windows

	## Highlights

	- 0.93 GB diffusion transformer, down from 7.75 GB for the FP16 FLUX.2 Klein 4B transformer
	- Binary {-1, +1} transformer weights with FP16 group-wise scaling in the matrix-heavy transformer layers (Q/K/V projections, output projections, MLP weights)
	- 4.09 GB CUDA deployment payload including the 4-bit text encoder and FP16 VAE — text encoder is offloaded after prompt encode, so the denoising loop only keeps the compact transformer and VAE resident
	- 4-step FlowMatch-Euler sampler with guidance = 1.0 and shift = 3.0 — no CFG, no negative prompts needed
	- Gemlite low-bit GEMM path for NVIDIA GPUs, with HQQ used for the compressed text encoder
	- Runs on Linux and Windows natively through the same CUDA / Gemlite deployment stack
	- Cross-platform companion: also available as [MLX 1-bit](https://huggingface.co/prism-ml/bonsai-image-binary-4B-mlx-1bit) for Apple Silicon

	## Resources

	- [Whitepaper](https://github.com/PrismML-Eng/Bonsai-Image-Demo/blob/main/bonsai-image-4b-whitepaper.pdf) — full benchmarks, kernels, and memory analysis
	- [Demo repo](https://github.com/PrismML-Eng/Bonsai-Image-Demo) — one-command setup for Mac / Linux / Windows
	- [Discord](https://discord.gg/prismml) — community + support
	- Kernels: [gemlite](https://github.com/mobiusml/gemlite) (fused low-bit GEMM) · [HQQ](https://github.com/mobiusml/hqq) (low-bit quantization runtime) · [triton-windows](https://github.com/triton-lang/triton-windows) (Windows path)

	## Model Overview

	\| Item \| Specification \|
	\| :-------------------- \| :----------------------------------------------------------------------------------------------\|
	\| Base architecture \| FLUX.2 Klein 4B (MMDiT diffusion transformer) \|
	\| Parameters \| ~4.0B (transformer trunk) \|
	\| Blocks \| 25 MMDiT blocks: 5 double-stream + 20 single-stream \|
	\| Sampler \| FlowMatchEuler, 4 steps, guidance = 1.0, shift = 3.0 \|
	\| Text encoder \| Qwen3-4B at 4-bit HQQ (≈ 2.84 GB CUDA payload, offloaded after prompt encode) \|
	\| VAE \| Flux2 32-channel latent, tiled decode (128 px tiles) \|
	\| Native resolution \| 1024×1024 (also supports 512×512 and arbitrary multiples of 32) \|
	\| Weight format \| Gemlite INT1 pack, binary values + FP16 group-wise scales \|
	\| Transformer size \| 0.93 GB model-level Bonsai representation; 1.08 GB CUDA packed deployment size \|
	\| Total payload \| 4.09 GB CUDA deployment payload (transformer + 4-bit text encoder + FP16 VAE) \|
	\| 1-bit coverage \| All 100 matmul-heavy linears in the 25 MMDiT blocks \|
	\| Platforms \| Linux x86_64 + Windows native on NVIDIA GPUs \|
	\| License \| Apache 2.0 \|

	## Binary Weight Representation: 1-bit g128

	Each binary weight takes a value from {−1, +1} with one shared FP16 scale per group of 128 weights:

	```text
	w_i = scale_g * b_i, b_i in {−1, +1}
	```

	Binary values carry exactly 1 bit of information per weight. With one FP16 scale per group of 128, the effective storage is

	```text
	b_eff ≈ 1 + 16/128 ≈ 1.125 bits/weight
	```

	This gives an idealized 14.2× reduction relative to FP16 for the binary transformer layers. A small set of precision-sensitive supporting tensors remains in FP16, so the final 1-bit Bonsai Image 4B diffusion transformer is 0.93 GB, an 8.3x reduction from the 7.75 GB FP16 FLUX.2 Klein 4B transformer.

	The binary representation is applied to the matrix-heavy transformer layers, including Q / K / V projections, output projections, MLP linears, and the double-stream add-K / Q / V linears. Supporting tensors (less than 5% of the total parameters) such as modulation streams, embedders, output norm, and output projection remain FP16 for image quality and stability.

	The CUDA deployment uses a Gemlite INT1 packed format. The model-level Bonsai representation is 0.93 GB; the deployed CUDA pack is 1.08 GB on disk due to runtime packing and alignment overhead in the current Gemlite path.

	### Memory

	\| Format \| Transformer size \| Reduction \| Ratio \|
	\| :------------------------------ \| ---------------: \| --------: \| -------: \|
	\| FP16 FLUX.2 Klein 4B \| 7.75 GB \| — \| 1.0× \|
	\| 1-bit Bonsai Image 4B \| 0.93 GB \| 88.0% \| 8.3× \|

	CUDA deployment:

	\| Component \| Size \|
	\| :------------------------------ \| ------: \|
	\| Gemlite INT1 diffusion transformer \| 1.08 GB \|
	\| HQQ 4-bit text encoder \| 2.84 GB \|
	\| FP16 VAE \| 0.17 GB \|
	\| Total payload \| 4.09 GB \|

	At runtime, the text encoder is offloaded after prompt encoding. During denoising, the repeated image-generation loop is dominated by the compact binary diffusion transformer and active image-generation components rather than the full payload.

	Peak HBM at 1024² on RTX 3080 is ~6.4 GiB end-to-end (transformer + VAE + activation memory).

	## Best Practices

	- Sampler: FlowMatchEuler-discrete with 4 steps, guidance = 1.0, shift = 3.0. The model is designed for 4 steps; running more steps does not improve quality significantly and can introduce artifacts.
	- Resolution: native 1024² is the design target. 512² works for quick previews.
	- Aspect ratios: multiples of 32 are supported, including 832x1248 and 1248x832.
	- Prompting: natural-language prompts. Negative prompts are not required.
	- Runtime memory: the text encoder is offloaded after prompt encoding, so the denoising loop is memory-light.

	## Quickstart

	### Bonsai Studio (Linux / Windows)

	The simplest path is the [Bonsai Image Demo repo](https://github.com/PrismML-Eng/Bonsai-Image-Demo), which sets up the full Bonsai Studio (FastAPI backend + Next.js frontend) and selects gemlite automatically on Linux / Windows:

	```bash
	git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
	cd Bonsai-Image-Demo
	./setup.sh
	BONSAI_VARIANT=binary ./scripts/download_model.sh
	BONSAI_VARIANT=binary ./scripts/serve.sh
	```

	On Windows (PowerShell):

	```powershell
	Set-ExecutionPolicy -Scope CurrentUser RemoteSigned # one-time
	.\setup.ps1
	$env:BONSAI_VARIANT = 'binary'
	.\scripts\download_model.ps1
	.\scripts\serve.ps1
	```

	### Python API (backend_gpu)

	For inference without the studio frontend:

	```python
	from backend_gpu.server import build_pipeline

	pipe = build_pipeline(model_id="prism-ml/bonsai-image-binary-4B-gemlite-1bit")
	image = pipe(
	prompt="A bonsai tree in a quiet ceramic studio, soft morning light",
	num_inference_steps=4,
	guidance_scale=1.0,
	height=1024,
	width=1024,
	).images[0]
	image.save("bonsai.png")
	```

	## Throughput (CUDA / gemlite)

	Warmed wall-clock per image, 4 sampler steps, guidance = 1.0, same prompts as the Mac and iPhone measurements. Linux + locally built gemlite kernels except where noted.

	\| Platform \| 512² (s) \| 1024² (s) \| Notes \|
	\| :------------------------ \| -------: \| --------: \| :------------------------------------------ \|
	\| A100 (Colab) \| 1.0 \| 2.7 \| Ampere datacenter (40 GB) \|
	\| RTX PRO 6000 Blackwell (Colab) \| 1.0 \| 1.8 \| NVIDIA Blackwell, 96 GB VRAM \|
	\| RTX 3080 10 GB \| 1.5 \| 4.5 \| Ampere consumer; 6.4 GiB peak HBM at 1024² \|
	\| RTX 3060 6 GB (laptop)\| 4.4 \| 24.8 \| Ampere mobile; memory-bound at 1024² \|

	The sub-2-bit pack is what keeps generation viable on commodity GPUs at 1024² — the consumer RTX 3080 reaches 4.5 s/image while the 6 GB laptop 3060 is the slow tail (memory-pressure limited).

	## Benchmarks

	Evaluated with matched generation settings across the comparison set on H100. GenEval uses the official 512x512 protocol. For HPSv3 and DPG-Bench, larger-backbone rows are evaluated at 1024x1024, while smaller-backbone rows are evaluated at their native 512x512 setting. Higher is better for all three benchmarks.

	\| Model \| Transformer (GB) \| GenEval \| HPSv3 \| DPG-Bench \|
	\| :-------------------------- \| ---------------: \| ------: \| -----: \| --------: \|
	\| Bonsai Image · Binary 4B \| 0.93 \| 0.671 \| 11.15 \| 0.822 \|
	\| Bonsai Image · Ternary 4B\| 1.21 \| 0.723 \| 12.22 \| 0.851 \|
	\| FLUX.2 Klein 4B \| 7.75 \| 0.819 \| 12.84 \| 0.853 \|
	\| FLUX.1-schnell \| 23.8 \| 0.716 \| 12.67 \| 0.848 \|
	\| SDXL \| 5.14 \| 0.300 \| 10.05 \| 0.740 \|
	\| PixArt-Σ XL 2 \| 1.20 \| 0.541 \| 11.93 \| 0.769 \|
	\| Stable Diffusion 1.5 \| 1.72 \| 0.396 \| 4.20 \| 0.601 \|
	\| BK-SDM-Small \| 0.98 \| 0.297 \| 3.05 \| 0.559 \|

	The benchmark results show the intended quality-footprint trade-off. 1-bit Bonsai Image 4B is the footprint-oriented variant: it reduces the diffusion transformer below 1 GB while still delivering strong GenEval, HPSv3, and DPG-Bench results. The ternary companion is the quality-oriented variant, using a slightly larger representation to achieve very close visual quality and prompt fidelity to the original FLUX.2 Klein 4B model.

	Together, the Bonsai Image variants move the quality-footprint frontier: they bring modern diffusion-transformer behavior into a memory range previously occupied by much smaller, lower-capability models.

	## Use Cases

	- Local creative tooling: image generation directly on CUDA-equipped workstations and consumer GPUs
	- Private generation: prompts and generated assets can remain in local or controlled environments
	- Rapid iteration: lower local latency and no remote queue for iterative creative workflows
	- Commodity-GPU serving: lower transformer footprint and reduced memory pressure for serving on NVIDIA GPUs
	- Windows and Linux deployment: native paths through the same Gemlite deployment stack
	- Enterprise and controlled inference: local or private environments for data residency and compliance-sensitive workflows

	## Limitations

	- 1-bit Bonsai Image 4B is not bit-identical to the FP16 FLUX.2 Klein 4B model; it is a compact binary-weight deployment designed to deliver similar practical behavior at much smaller size.
	- Image-generation quality remains prompt- and workflow-dependent. Small text, fine details, object counts, and strict compositional constraints should be evaluated for the target use case.
	- Current commodity inference stacks do not yet expose fully native binary execution as a standard hardware path. This release uses practical Gemlite low-bit GEMM kernels on CUDA.
	- After the diffusion transformer is made compact, other components such as the VAE can become more visible memory bottlenecks. The runtime mitigates this with text-encoder offload and tiled VAE decoding.


	## Citation

	```bibtex
	@techreport{bonsaiimage4b,
	title = {Bonsai Image 4B: Low-Bit Diffusion on Apple Silicon and Consumer GPUs},
	author = {Prism ML},
	year = {2026},
	month = {May},
	url = {https://prismml.com}
	}
	```

	## Contact

	For questions, feedback, or collaboration inquiries: contact@prismml.com